- Problem Statement
- Solution 1 Fine-tuning Sentence Transformers
- Model Deployment
- Solution 2 Large Language Model
- Build from Source
- Conclusion and Future Work
- Miscellaneous
- Contact
The objective of this project is to develop a Machine Learning model that can detect potential AI use by comparing candidate coding answers to responses generated by AI models (GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo). The model will predict AI-detected score ranging in floating point numbers from 0
(no AI detected) to 1
(a lot of AI-detected). Since the prediction is a continuous value in a range, this is a Regression problem which can be solved in multiple ways. In this problem's context, I will demonstrate two solutions:
- Solution 1: Fine-tuning multiple Sentence Transformers on the given dataset. The best performing fine-tuned model was deployed to HuggingFace and integrated with a Flask Webapp.
- Solution 2: Utilizing OpenAI's
GPT-4o
LLM to predict similarity score. Details documented in this section.
Please watch the YouTube video presentation of this project, or follow this README file for detailed documentation.
The original dataset contains examples of coding questions, candidate answers, AI-generated answers, and the corresponding AI-detected scores to train and test the model. The goal is for the model to predict AI-detected scores for new, unseen data. The dataset can be found in the data
directory of this project. The dataset has one directory and one file as follows:
dataset-source-codes
directory: This directory has 63 subdirectories (source_code_000
...source_code_062
). Each of these subdirectories represents a coding question and its respective answers completed by both candidate and AI. A subdirectory, say,source_code_000
has the following 8 files:
source_code_000.json
: Contains a coding question in a specific programing language (Java in this case) and metadata related to that questionsource_code_000.jav
: Contains candidate's answer code snippet written in Javasource_code_000_gpt-3.5-turbo_00.jav
and...01.jav
: These two files have two samples of the respective coding answer completed by GPT-3.5-Turbosource_code_000_gpt-4_00.jav
and...01.jav
: These two files have two samples of the respective coding answer completed by GPT-4source_code_000_gpt-4-turbo_00.jav
and...01.jav
: These two files have two samples of the respective coding answer completed by GPT-4-Turbo
CodeAid Source Codes Labeling.xlsx
: This file maps a candidate's answers to its respective AI-generated answers and assigns a plagiarism score. It has 3 columns and 378 rows as follows:
coding_problem_id | llm_answer_id | plagiarism_score | |
---|---|---|---|
1 | source_code_000 | gpt-3.5-turbo_00 | 0 |
2 | source_code_000 | gpt-3.5-turbo_01 | 0 |
3 | source_code_000 | gpt-4_00 | 0 |
... | ... | ... | ... |
378 | source_code_062 | gpt-4-turbo_01 | 0.30 |
We need to preprocess the data to build a consistent dataset structure for the model training and validation. Preprocessing involves the following steps:
- Load all 63 subdirectories containing the coding questions and answers from the
dataset-source-codes
directory - Load
CodeAid Source Codes Labeling.xlsx
file with plagiarism scores - Create a new tabular dataset where each row has a coding question, respective candidate answer, AI-generated answer, and the associated plagiarism score. This step creates 378 rows. Since, there are 6 different plagiarism/similarity scores for a candidate's answer (2 samples of AI-generated answers for 3 different variants of LLMs, so 63 * 2 * 3 = 378 rows)
- Adds two new columns combining coding question with candidate answer and AI-generated answer respectively (This is necessary for feature extraction purposes)
The preprocessed data with 6 columns and 378 rows is as follows:
question | candidate_answer | ai_answer | similarity_score | candidate_combined | ai_combined | |
---|---|---|---|---|---|---|
1 | Write a program to find... | fun findLargestElement... | public class LargestEle... | 0.0 | Question: Write a program... | Question: Write a program... |
2 | Write a program to find... | fun findLargestElement... | public class Main {\n ... | 0.0 | Question: Write a program... | Question: Write a program... |
3 | Write a program to find... | fun findLargestElement... | public class Main {\n ... | 0.0 | Question: Write a program... | Question: Write a program... |
... | ... | ... | ... | ... | ... | ... |
378 | Create a PHP script that will... | <?php\nfunction getTop... | <?php\n\n// Function... | 0.3 | Question: Create a PHP script... | Question: Create a PHP script... |
The detailed preprocessing documentation can be found in preprocessing.ipynb
Jupyter Notebook or preprocessing.py
file. The preprocessed data is saved as preprocessed_data.csv
file.
5 different Sentence Transformers were selected for fine-tuning based on their average performance on sentence encoding from sbert.net. Model training is where feature extraction happens. The process of feature extraction is centered around the specified SentenceTransformer
model, which is used to encode textual data into dense numerical vectors (embeddings). Following is a detailed explanation of how feature extraction is done:
-
Input Data:
- The input to the model consists of two columns:
candidate_combined
(the candidate's answer) andai_combined
(the AI-generated answer). These represent the two pieces of text whose similarity will be compared. - The
similarity_score
is the label, representing how similar the two pieces of text are, which the model learns to predict during training.
- The input to the model consists of two columns:
-
Creating Examples for Training:
- The line
InputExample(texts=[row['candidate_combined'], row['ai_combined']], label=float(row['similarity_score']))
creates training examples for the model. texts
is a pair of texts that will be encoded into numerical vectors (embeddings) by theSentenceTransformer
model. These embeddings represent the features extracted from the text data.- These
InputExample
s are then passed into aDataLoader
, which prepares batches of data for training.
- The line
-
SentenceTransformer Model:
- The core feature extraction happens when the
SentenceTransformer
is initialized. This model is pre-trained on large corpora and can convert input texts into high-dimensional vectors (embeddings). - When the training data is passed through the model, it encodes each text (from both
candidate_combined
andai_combined
) into a fixed-size embedding. These embeddings are vector representations of the text that capture semantic meaning, making them suitable for downstream tasks like similarity measurement.
- The core feature extraction happens when the
-
Cosine Similarity Loss:
- The
CosineSimilarityLoss
is used as the loss function for training. The model learns to minimize the cosine distance between embeddings of semantically similar texts (texts with highersimilarity_score
) and maximize the distance for dissimilar ones. - This process adjusts the model's weights to better encode the features that represent textual similarity.
- The
-
Validation and Evaluation:
- For validation, the code prepares examples similarly, but these are used for evaluation instead of training.
- The
EmbeddingSimilarityEvaluator
computes the similarity between the embeddings ofcandidate_combined
andai_combined
using their cosine similarity, and compares it with the actualsimilarity_score
.
-
How Features Are Encoded:
- Each piece of text (both
candidate_combined
andai_combined
) is passed through theSentenceTransformer
model. - The model tokenizes the text, then converts it into a dense embedding vector of fixed length. These embeddings encode semantic information about the text.
- The embeddings are the "features" extracted from the text, which are then used to compute similarity.
- Each piece of text (both
The features in this code are the dense embeddings extracted by the SentenceTransformer
model. These embeddings are used to train the model to learn similarities between pairs of text using the cosine similarity loss function.
All of the 5 models were trained for 5
epochs with a varying batch size from 4
to 16
. The detailed model training documentation can be found in train.ipynb
Jupyter Notebook or train.py
file. The fine-tuned model is saved at models
directory.
Following are the 8 different metrics employed for evaluating the 5 fine-tuned models:
Metric | all-mpnet-base-v2 | all-distilroberta-v1 | all-MiniLM-L12-v2 | all-MiniLM-L6-v2 | multi-qa-mpnet-base-dot-v1 |
---|---|---|---|---|---|
Cosine Spearman | 0.9508 | 0.9519 | 0.8966 | 0.9 | 0.9672 |
Manhattan Spearman | 0.95 | 0.9477 | 0.8925 | 0.8931 | 0.9603 |
Euclidean Spearman | 0.9508 | 0.9519 | 0.8966 | 0.9 | 0.9551 |
Dot Product Spearman | 0.9508 | 0.9519 | 0.8966 | 0.9 | 0.9652 |
Mean Squared Error | 0.0063 | 0.0056 | 0.0257 | 0.0165 | 0.0086 |
Root Mean Squared Error | 0.0794 | 0.0749 | 0.1602 | 0.1284 | 0.0925 |
Mean Absolute Error | 0.0583 | 0.0534 | 0.0954 | 0.0880 | 0.0702 |
R-squared Score | 0.9119 | 0.9215 | 0.6412 | 0.7696 | 0.8805 |
From the above metrics, we can derive several insights about the performance of each model across different evaluation criteria. Let us see one by one:
-
all-mpnet-base-v2:
- Generally performs well in all metrics, especially with Spearman correlation metrics (Cosine, Manhattan, Euclidean, Dot Product), showing consistency across different similarity measures.
- Has a slightly lower Mean Squared Error (0.0063), indicating good predictive performance.
- RMSE and MAE are low compared to most other models, and the R-squared score (0.9119) reflects a strong goodness-of-fit.
-
all-distilroberta-v1:
- Slightly outperforms all-mpnet-base-v2 in most Spearman correlations, showing excellent alignment with ground-truth similarities.
- Exhibits the lowest Mean Squared Error (0.0056) and RMSE (0.0749), suggesting this model makes fewer errors in prediction.
- It also has the highest R-squared score (0.9215), indicating it captures the most variance and performs very well across the board.
-
all-MiniLM-L12-v2:
- Performs relatively poorly in comparison to the other models, with lower Spearman correlations (around 0.89–0.90) and much higher error metrics.
- Has the highest Mean Squared Error (0.0257), RMSE (0.1602), and MAE (0.0954), showing that this model's predictions are less accurate.
- Its R-squared score is the lowest (0.6412), implying a weaker fit to the data.
-
all-MiniLM-L6-v2:
- Performs similarly to MiniLM-L12, but with slightly better error metrics, though still worse than most other models.
- Has moderate Spearman correlations (around 0.89–0.9), but significantly higher errors (MSE = 0.0165, RMSE = 0.1284, MAE = 0.0880) than top-performing models.
- Its R-squared score (0.7696) is better than L12 but still indicates room for improvement.
-
multi-qa-mpnet-base-dot-v1:
- This model shows the highest performance in Spearman correlation metrics, particularly in Cosine (0.9672), Dot Product (0.9652), and Manhattan (0.9603) similarity, suggesting it captures similarity relationships very well.
- Though it has higher errors (MSE = 0.0086, RMSE = 0.0925) than distilroberta, they are still reasonable.
- With an R-squared score of 0.8805, it demonstrates strong predictive power and is one of the top-performing models overall.
- all-distilroberta-v1 consistently performs best in terms of error metrics (MSE, RMSE, MAE) and variance explained (R-squared).
- multi-qa-mpnet-base-dot-v1 excels in similarity-based evaluations (Spearman correlations), making it highly effective for tasks requiring strong semantic understanding.
- all-mpnet-base-v2 offers a balanced performance across both error and similarity metrics.
- all-MiniLM-L12-v2 and L6-v2 are the weaker models in this comparison, especially in terms of error metrics and R-squared scores.
The detailed evaluation process can be found in evaluation.ipynb
Jupyter Notebook or evaluation.py
file.
From the model evaluation, it is evident that overall the fine-tuned all-distilroberta-v1
model is relatively the best performing model. So, I decided to deploy this model.
For deployment, we need to convert this model to ONNX (Open Neural Network Exchange) format. ONNX optimizes performance during inference and is supported by a variety of hardware, such as CPUs and GPUs, making it ideal for real-world deployment. Exporting models to ONNX simplifies deployment and ensures models can run efficiently across different systems. It also enables cross-platform compatibility, allowing models trained in one framework to be deployed in another.
We must also quantize ONNX models before deploying. Quantizing reduces their size, speeds up inference, and lowers memory usage, all with minimal loss of accuracy. This combination of ONNX and quantization is particularly significant for deploying models on resource-constrained platforms.
The detailed model export and compression documentation can be found in export.ipynb
Jupyter Notebook or export.py
file. The exported and quantized model is saved at models
directory.
The quantized all-distilroberta-v1
model was deployed to HuggingFace spaces with Gradio. Following is a snapshot of the HuggingFace app.
The HuggingFace deployment source code can be found in the huggingface
directory of this repository.
A custom Web Application was developed with Flask to demonstrate the AI detection model's capability. It uses the HuggingFace Spaces API in the backend. Following are some snapshots of the Flask webapp:
The Flask web app deployment source code can be found in the flask_app
directory of this repository.
This method uses OpenAI's GPT-4o
's Natural Language Processing (NLP) capabilities to understand the context of the coding question and both the candidate's answer and AI-generated answer. It performs a semantic comparison to predict how closely the answers match based on the meaning and structure of the code. Here's how it works:
The code loads three inputs from text files:
question.txt
(the coding question).candidate_answer.txt
(the human-written answer).ai_answer.txt
(the AI-generated answer).
These files contain the necessary test text inputs for the evaluation. Modify the contents of these files for testing on your own data.
- The system prompt is used to instruct
GPT-4o
on its role as a "code similarity evaluator." It defines the task as comparing two answers and returning a similarity score as a floating point number between 0 and 1. - The human prompt consists of the actual coding question, candidate's answer, and AI-generated answer.
GPT-4o
uses these inputs to assess the similarity.
- The text inputs are processed using LangChain's
RunnableSequence
, combining the system prompt and user inputs. GPT-4o
then evaluates how similar the candidate’s answer is to the AI-generated answer and produces a floating-point similarity score between 0 (completely different) and 1 (exact match).
- The
GPT-4o
model's output, which is anAIMessage
object, contains the similarity score. - The script extracts the score from the response and prints it as the final result.
The detailed source code can be found in the llm
directory of this repository.
- Clone the repo
git clone https://github.com/zzarif/AI-Detector.git
cd AI-Detector/
- Initialize and activate virtual environment
virtualenv venv
source venv/Scripts/activate
- Install dependencies
pip install -r requirements.txt
Note: Select virtual environment interpreter from Ctrl
+Shift
+P
- Preprocess the Data
Run all the cells in preprocessing.ipynb
Jupyter Notebook or run the following script:
python scripts/preprocessing.py
- Train the model
Run all the cells in train.ipynb
Jupyter Notebook or run the following script (with specified model and hyperparameters):
python scripts/train.py --model all-MiniLM-L6-v2 --epochs 5 --batch_size 16
- Evaluate the model
Run all the cells in evaluation.ipynb
Jupyter Notebook or run the following script (with the specified fine-tuned model):
python scripts/evaluation.py --ft_model all-MiniLM-L6-v2
- Perform model inference
Run all the cells in inference.ipynb
Jupyter Notebook or run the following script (with the specified fine-tuned model):
python scripts/inference.py --ft_model all-MiniLM-L6-v2
- Export model for deployment
Run all the cells in export.ipynb
Jupyter Notebook or run the following script (with the specified fine-tuned model):
python scripts/export.py --ft_model all-MiniLM-L6-v2
- Predict similarity with LLM
Create a .env
file in the llm
directory and insert the following line:
OPENAI_API_KEY=<your-openai-api-key>
Note: Replace <your-openai-api-key>
with your own API key from OpenAI API Keys page.
Then run the following script:
python llm/main.py
Note: Modify the contents of the .txt
files to test the model's capability on your own data.
In this comprehensive project, I dealt with similarity predicting Regression task. I had to develop a Machine Learning model that can detect potential AI use by comparing candidate coding answers to responses generated by AI models (GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo). The model can predict AI-detected score ranging in floating point numbers from 0
(no AI detected) to 1
(a lot of AI-detected).
In this project's context, I presented two different solutions for solving the task at hand. First solution was fine-tuning multiple Sentence Transformers on the provided dataset. From the model evaluation it was evident that all of the models performed significantly well on the validation dataset. Since, the fine-tuned all-distilroberta-v1
had relatively the best performance from the models, it was exported and quantized for deployment. The final quantized model was deployed to HuggingFace and integrated with a custom Flask Web App.
In the second solution, I used OpenAI's GPT-4o
's Natural Language Processing (NLP) capabilities to understand the context of the coding question and both the candidate's answer and AI-generated answer. It performs a semantic comparison to predict how closely the answers match based on the meaning and structure of the code.
Due to time constraints, I couldn't experiment further. For instance, I deployed fine-tuned all-distilroberta-v1
since it had relatively the best performance overall. But, models like all-mpnet-base-v2
and multi-qa-mpnet-base-dot-v1
were fairly good candidates for deployment as well. I will export, quantize, and deploy these models as my future work. Also, I could employ more advanced techniques to solve the task at hand. For instance, I could develop another SimilarityPredictor
on top of the SBERT model that could find a pattern between the candidate coding answer to its respective similarity prediction. This could potentially allow the SimilarityPredictor
to predict similarity score directly from the candidate's coding answer without needing an equivalent AI-generated answer as reference. I will add this to my future work as well.
Finally, I used a commercial LLM (OpenAI's GPT-4o
) for my second solution. Due to time constraints, I couldn't further experiment with open-source options like Llama3
, Llama2
, Mistral3
, etc.
The utils
directory contains some helper scripts. For example, file_extension_resolver.py
script parses the original dataset and creates the programming language to extension mapping dictionary used in preprocessing.py
file.
Thank you so much for your interest. Would love your valuable feedback!