Skip to content

zzarif/AI-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


AI Detector

Detect AI generated coding answers

Table of Contents

  1. Problem Statement
  2. Solution 1 Fine-tuning Sentence Transformers
  3. Model Deployment
  4. Solution 2 Large Language Model
  5. Build from Source
  6. Conclusion and Future Work
  7. Miscellaneous
  8. Contact

Problem Statement

The objective of this project is to develop a Machine Learning model that can detect potential AI use by comparing candidate coding answers to responses generated by AI models (GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo). The model will predict AI-detected score ranging in floating point numbers from 0 (no AI detected) to 1 (a lot of AI-detected). Since the prediction is a continuous value in a range, this is a Regression problem which can be solved in multiple ways. In this problem's context, I will demonstrate two solutions:

  • Solution 1: Fine-tuning multiple Sentence Transformers on the given dataset. The best performing fine-tuned model was deployed to HuggingFace and integrated with a Flask Webapp.
  • Solution 2: Utilizing OpenAI's GPT-4o LLM to predict similarity score. Details documented in this section.

Please watch the YouTube video presentation of this project, or follow this README file for detailed documentation.

Solution 1: Fine-tuning Sentence Transformers

Data Understanding

The original dataset contains examples of coding questions, candidate answers, AI-generated answers, and the corresponding AI-detected scores to train and test the model. The goal is for the model to predict AI-detected scores for new, unseen data. The dataset can be found in the data directory of this project. The dataset has one directory and one file as follows:

  1. dataset-source-codes directory: This directory has 63 subdirectories (source_code_000 ... source_code_062). Each of these subdirectories represents a coding question and its respective answers completed by both candidate and AI. A subdirectory, say, source_code_000 has the following 8 files:
  • source_code_000.json: Contains a coding question in a specific programing language (Java in this case) and metadata related to that question
  • source_code_000.jav: Contains candidate's answer code snippet written in Java
  • source_code_000_gpt-3.5-turbo_00.jav and ...01.jav: These two files have two samples of the respective coding answer completed by GPT-3.5-Turbo
  • source_code_000_gpt-4_00.jav and ...01.jav: These two files have two samples of the respective coding answer completed by GPT-4
  • source_code_000_gpt-4-turbo_00.jav and ...01.jav: These two files have two samples of the respective coding answer completed by GPT-4-Turbo
  1. CodeAid Source Codes Labeling.xlsx: This file maps a candidate's answers to its respective AI-generated answers and assigns a plagiarism score. It has 3 columns and 378 rows as follows:
coding_problem_id llm_answer_id plagiarism_score
1 source_code_000 gpt-3.5-turbo_00 0
2 source_code_000 gpt-3.5-turbo_01 0
3 source_code_000 gpt-4_00 0
... ... ... ...
378 source_code_062 gpt-4-turbo_01 0.30

Data Preprocessing

We need to preprocess the data to build a consistent dataset structure for the model training and validation. Preprocessing involves the following steps:

  1. Load all 63 subdirectories containing the coding questions and answers from the dataset-source-codes directory
  2. Load CodeAid Source Codes Labeling.xlsx file with plagiarism scores
  3. Create a new tabular dataset where each row has a coding question, respective candidate answer, AI-generated answer, and the associated plagiarism score. This step creates 378 rows. Since, there are 6 different plagiarism/similarity scores for a candidate's answer (2 samples of AI-generated answers for 3 different variants of LLMs, so 63 * 2 * 3 = 378 rows)
  4. Adds two new columns combining coding question with candidate answer and AI-generated answer respectively (This is necessary for feature extraction purposes)

The preprocessed data with 6 columns and 378 rows is as follows:

question candidate_answer ai_answer similarity_score candidate_combined ai_combined
1 Write a program to find... fun findLargestElement... public class LargestEle... 0.0 Question: Write a program... Question: Write a program...
2 Write a program to find... fun findLargestElement... public class Main {\n ... 0.0 Question: Write a program... Question: Write a program...
3 Write a program to find... fun findLargestElement... public class Main {\n ... 0.0 Question: Write a program... Question: Write a program...
... ... ... ... ... ... ...
378 Create a PHP script that will... <?php\nfunction getTop... <?php\n\n// Function... 0.3 Question: Create a PHP script... Question: Create a PHP script...

The detailed preprocessing documentation can be found in preprocessing.ipynb Jupyter Notebook or preprocessing.py file. The preprocessed data is saved as preprocessed_data.csv file.

Model Training

5 different Sentence Transformers were selected for fine-tuning based on their average performance on sentence encoding from sbert.net. Model training is where feature extraction happens. The process of feature extraction is centered around the specified SentenceTransformer model, which is used to encode textual data into dense numerical vectors (embeddings). Following is a detailed explanation of how feature extraction is done:

  1. Input Data:

    • The input to the model consists of two columns: candidate_combined (the candidate's answer) and ai_combined (the AI-generated answer). These represent the two pieces of text whose similarity will be compared.
    • The similarity_score is the label, representing how similar the two pieces of text are, which the model learns to predict during training.
  2. Creating Examples for Training:

    • The line InputExample(texts=[row['candidate_combined'], row['ai_combined']], label=float(row['similarity_score'])) creates training examples for the model.
    • texts is a pair of texts that will be encoded into numerical vectors (embeddings) by the SentenceTransformer model. These embeddings represent the features extracted from the text data.
    • These InputExamples are then passed into a DataLoader, which prepares batches of data for training.
  3. SentenceTransformer Model:

    • The core feature extraction happens when the SentenceTransformer is initialized. This model is pre-trained on large corpora and can convert input texts into high-dimensional vectors (embeddings).
    • When the training data is passed through the model, it encodes each text (from both candidate_combined and ai_combined) into a fixed-size embedding. These embeddings are vector representations of the text that capture semantic meaning, making them suitable for downstream tasks like similarity measurement.
  4. Cosine Similarity Loss:

    • The CosineSimilarityLoss is used as the loss function for training. The model learns to minimize the cosine distance between embeddings of semantically similar texts (texts with higher similarity_score) and maximize the distance for dissimilar ones.
    • This process adjusts the model's weights to better encode the features that represent textual similarity.
  5. Validation and Evaluation:

    • For validation, the code prepares examples similarly, but these are used for evaluation instead of training.
    • The EmbeddingSimilarityEvaluator computes the similarity between the embeddings of candidate_combined and ai_combined using their cosine similarity, and compares it with the actual similarity_score.
  6. How Features Are Encoded:

    • Each piece of text (both candidate_combined and ai_combined) is passed through the SentenceTransformer model.
    • The model tokenizes the text, then converts it into a dense embedding vector of fixed length. These embeddings encode semantic information about the text.
    • The embeddings are the "features" extracted from the text, which are then used to compute similarity.

The features in this code are the dense embeddings extracted by the SentenceTransformer model. These embeddings are used to train the model to learn similarities between pairs of text using the cosine similarity loss function.

All of the 5 models were trained for 5 epochs with a varying batch size from 4 to 16. The detailed model training documentation can be found in train.ipynb Jupyter Notebook or train.py file. The fine-tuned model is saved at models directory.

Model Evaluation

Following are the 8 different metrics employed for evaluating the 5 fine-tuned models:

Metric all-mpnet-base-v2 all-distilroberta-v1 all-MiniLM-L12-v2 all-MiniLM-L6-v2 multi-qa-mpnet-base-dot-v1
Cosine Spearman 0.9508 0.9519 0.8966 0.9 0.9672
Manhattan Spearman 0.95 0.9477 0.8925 0.8931 0.9603
Euclidean Spearman 0.9508 0.9519 0.8966 0.9 0.9551
Dot Product Spearman 0.9508 0.9519 0.8966 0.9 0.9652
Mean Squared Error 0.0063 0.0056 0.0257 0.0165 0.0086
Root Mean Squared Error 0.0794 0.0749 0.1602 0.1284 0.0925
Mean Absolute Error 0.0583 0.0534 0.0954 0.0880 0.0702
R-squared Score 0.9119 0.9215 0.6412 0.7696 0.8805

From the above metrics, we can derive several insights about the performance of each model across different evaluation criteria. Let us see one by one:

  1. all-mpnet-base-v2:

    • Generally performs well in all metrics, especially with Spearman correlation metrics (Cosine, Manhattan, Euclidean, Dot Product), showing consistency across different similarity measures.
    • Has a slightly lower Mean Squared Error (0.0063), indicating good predictive performance.
    • RMSE and MAE are low compared to most other models, and the R-squared score (0.9119) reflects a strong goodness-of-fit.
  2. all-distilroberta-v1:

    • Slightly outperforms all-mpnet-base-v2 in most Spearman correlations, showing excellent alignment with ground-truth similarities.
    • Exhibits the lowest Mean Squared Error (0.0056) and RMSE (0.0749), suggesting this model makes fewer errors in prediction.
    • It also has the highest R-squared score (0.9215), indicating it captures the most variance and performs very well across the board.
  3. all-MiniLM-L12-v2:

    • Performs relatively poorly in comparison to the other models, with lower Spearman correlations (around 0.89–0.90) and much higher error metrics.
    • Has the highest Mean Squared Error (0.0257), RMSE (0.1602), and MAE (0.0954), showing that this model's predictions are less accurate.
    • Its R-squared score is the lowest (0.6412), implying a weaker fit to the data.
  4. all-MiniLM-L6-v2:

    • Performs similarly to MiniLM-L12, but with slightly better error metrics, though still worse than most other models.
    • Has moderate Spearman correlations (around 0.89–0.9), but significantly higher errors (MSE = 0.0165, RMSE = 0.1284, MAE = 0.0880) than top-performing models.
    • Its R-squared score (0.7696) is better than L12 but still indicates room for improvement.
  5. multi-qa-mpnet-base-dot-v1:

    • This model shows the highest performance in Spearman correlation metrics, particularly in Cosine (0.9672), Dot Product (0.9652), and Manhattan (0.9603) similarity, suggesting it captures similarity relationships very well.
    • Though it has higher errors (MSE = 0.0086, RMSE = 0.0925) than distilroberta, they are still reasonable.
    • With an R-squared score of 0.8805, it demonstrates strong predictive power and is one of the top-performing models overall.

Summary:

  • all-distilroberta-v1 consistently performs best in terms of error metrics (MSE, RMSE, MAE) and variance explained (R-squared).
  • multi-qa-mpnet-base-dot-v1 excels in similarity-based evaluations (Spearman correlations), making it highly effective for tasks requiring strong semantic understanding.
  • all-mpnet-base-v2 offers a balanced performance across both error and similarity metrics.
  • all-MiniLM-L12-v2 and L6-v2 are the weaker models in this comparison, especially in terms of error metrics and R-squared scores.

The detailed evaluation process can be found in evaluation.ipynb Jupyter Notebook or evaluation.py file.

Model Deployment

Model Export and Compression

From the model evaluation, it is evident that overall the fine-tuned all-distilroberta-v1 model is relatively the best performing model. So, I decided to deploy this model.

For deployment, we need to convert this model to ONNX (Open Neural Network Exchange) format. ONNX optimizes performance during inference and is supported by a variety of hardware, such as CPUs and GPUs, making it ideal for real-world deployment. Exporting models to ONNX simplifies deployment and ensures models can run efficiently across different systems. It also enables cross-platform compatibility, allowing models trained in one framework to be deployed in another.

We must also quantize ONNX models before deploying. Quantizing reduces their size, speeds up inference, and lowers memory usage, all with minimal loss of accuracy. This combination of ONNX and quantization is particularly significant for deploying models on resource-constrained platforms.

The detailed model export and compression documentation can be found in export.ipynb Jupyter Notebook or export.py file. The exported and quantized model is saved at models directory.

HuggingFace Deployment

The quantized all-distilroberta-v1 model was deployed to HuggingFace spaces with Gradio. Following is a snapshot of the HuggingFace app.

HF deployment

The HuggingFace deployment source code can be found in the huggingface directory of this repository.

Flask Web Deployment

A custom Web Application was developed with Flask to demonstrate the AI detection model's capability. It uses the HuggingFace Spaces API in the backend. Following are some snapshots of the Flask webapp:

Flask Web App

Flask Web App

The Flask web app deployment source code can be found in the flask_app directory of this repository.

Solution 2: Large Language Model

This method uses OpenAI's GPT-4o's Natural Language Processing (NLP) capabilities to understand the context of the coding question and both the candidate's answer and AI-generated answer. It performs a semantic comparison to predict how closely the answers match based on the meaning and structure of the code. Here's how it works:

Input Handling

The code loads three inputs from text files:

These files contain the necessary test text inputs for the evaluation. Modify the contents of these files for testing on your own data.

System and Human Prompts

  • The system prompt is used to instruct GPT-4o on its role as a "code similarity evaluator." It defines the task as comparing two answers and returning a similarity score as a floating point number between 0 and 1.
  • The human prompt consists of the actual coding question, candidate's answer, and AI-generated answer. GPT-4o uses these inputs to assess the similarity.

GPT-4o Prediction

  • The text inputs are processed using LangChain's RunnableSequence, combining the system prompt and user inputs.
  • GPT-4o then evaluates how similar the candidate’s answer is to the AI-generated answer and produces a floating-point similarity score between 0 (completely different) and 1 (exact match).

Model Output

  • The GPT-4o model's output, which is an AIMessage object, contains the similarity score.
  • The script extracts the score from the response and prints it as the final result.

The detailed source code can be found in the llm directory of this repository.

Build from Source

  1. Clone the repo
git clone https://github.com/zzarif/AI-Detector.git
cd AI-Detector/
  1. Initialize and activate virtual environment
virtualenv venv
source venv/Scripts/activate
  1. Install dependencies
pip install -r requirements.txt

Note: Select virtual environment interpreter from Ctrl+Shift+P

  1. Preprocess the Data

Run all the cells in preprocessing.ipynb Jupyter Notebook or run the following script:

python scripts/preprocessing.py
  1. Train the model

Run all the cells in train.ipynb Jupyter Notebook or run the following script (with specified model and hyperparameters):

python scripts/train.py --model all-MiniLM-L6-v2 --epochs 5 --batch_size 16
  1. Evaluate the model

Run all the cells in evaluation.ipynb Jupyter Notebook or run the following script (with the specified fine-tuned model):

python scripts/evaluation.py --ft_model all-MiniLM-L6-v2
  1. Perform model inference

Run all the cells in inference.ipynb Jupyter Notebook or run the following script (with the specified fine-tuned model):

python scripts/inference.py --ft_model all-MiniLM-L6-v2
  1. Export model for deployment

Run all the cells in export.ipynb Jupyter Notebook or run the following script (with the specified fine-tuned model):

python scripts/export.py --ft_model all-MiniLM-L6-v2
  1. Predict similarity with LLM

Create a .env file in the llm directory and insert the following line:

OPENAI_API_KEY=<your-openai-api-key>

Note: Replace <your-openai-api-key> with your own API key from OpenAI API Keys page.

Then run the following script:

python llm/main.py

Note: Modify the contents of the .txt files to test the model's capability on your own data.

Conclusion and Future Works

In this comprehensive project, I dealt with similarity predicting Regression task. I had to develop a Machine Learning model that can detect potential AI use by comparing candidate coding answers to responses generated by AI models (GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo). The model can predict AI-detected score ranging in floating point numbers from 0 (no AI detected) to 1 (a lot of AI-detected).

In this project's context, I presented two different solutions for solving the task at hand. First solution was fine-tuning multiple Sentence Transformers on the provided dataset. From the model evaluation it was evident that all of the models performed significantly well on the validation dataset. Since, the fine-tuned all-distilroberta-v1 had relatively the best performance from the models, it was exported and quantized for deployment. The final quantized model was deployed to HuggingFace and integrated with a custom Flask Web App.

In the second solution, I used OpenAI's GPT-4o's Natural Language Processing (NLP) capabilities to understand the context of the coding question and both the candidate's answer and AI-generated answer. It performs a semantic comparison to predict how closely the answers match based on the meaning and structure of the code.

Due to time constraints, I couldn't experiment further. For instance, I deployed fine-tuned all-distilroberta-v1 since it had relatively the best performance overall. But, models like all-mpnet-base-v2 and multi-qa-mpnet-base-dot-v1 were fairly good candidates for deployment as well. I will export, quantize, and deploy these models as my future work. Also, I could employ more advanced techniques to solve the task at hand. For instance, I could develop another SimilarityPredictor on top of the SBERT model that could find a pattern between the candidate coding answer to its respective similarity prediction. This could potentially allow the SimilarityPredictor to predict similarity score directly from the candidate's coding answer without needing an equivalent AI-generated answer as reference. I will add this to my future work as well.

Finally, I used a commercial LLM (OpenAI's GPT-4o) for my second solution. Due to time constraints, I couldn't further experiment with open-source options like Llama3, Llama2, Mistral3, etc.

Miscellaneous

The utils directory contains some helper scripts. For example, file_extension_resolver.py script parses the original dataset and creates the programming language to extension mapping dictionary used in preprocessing.py file.

Contact

LinkedIn Mail

Thank you so much for your interest. Would love your valuable feedback!