This project aims to independently benchmark the performance of various ChatGPT and Gemini models in debugging Finite Logic (FL) and Automatic Program Repair (APR) tasks. The investigation is divided into two main focus areas:
- Objective: To evaluate the performance of different ChatGPT and Gemini models in debugging tasks.
- Process: Each model is independently tested to determine which performs best in terms of accuracy.
- Objective: To implement a multi-agent system where LLMs (Large Language Models) work together in collaborative conversations to carry out the debugging process.
- Process: After identifying the best-performing models, they are integrated into a multi-agent framework. This system allows LLMs to collaborate in conversations aimed at improving debugging performance.
- Outcome: Additional testing is conducted to determine if collaborative efforts between LLM agents lead to better debugging results compared to individual models.
The primary goals of this investigation are:
- To assess the viability of LLMs in debugging tasks.
- To explore whether collaborative conversations between multiple LLM agents can improve debugging performance.
The code in this repository is organized into three distinct branches to streamline the benchmarking and evaluation process:
- Gemini: Benchmarking different Gemini Models
- ChatGPT: Benchmarking different GPT Models
- Multi-Agent-Conversation: Evaluating the performance of the Multi-Agent System
-
EvalGPTFix_Extracted & Gemini_Extracted
These folders contain the structured data extracted from the datasets used in benchmarking the models. -
LLM_Scripts
This folder contains the Python scripts that implement the pipeline for prompting the LLMs and generating files to store their responses. -
Responses
All responses generated by the LLMs are stored in sequential order. This includes:- CSV files
- Code files produced during the debugging tasks
-
Tests
This folder contains the validation of the results using test cases. -
Analysis
Test outcomes and the general analysis of each model’s performance can be found here.
To run the multi-agent system or any of the individual LLM scripts, please follow the steps below:
Ensure that you have the following installed:
-
Python (Version 3.7+)
Download and install Python from here. -
Java Download and install Python from here.
-
JUnit Download the JUnit VSCode extension to run the Java tests from here
-
Python Libraries
Install the required libraries by running:pip install openai pip install -q -U google-generativeai pip instal python-dotenv pip install pyautogen pip install pyautogen[gemini] pip install pyautogen[gemini,retrievechat,lmm] pip install pytest pip install panel
Create a .env
file in the root directory and include the following keys with your respective API credentials:
OPENAI_API_KEY=<Your OpenAI API Key Here>
GEMINI_API_KEY=<Your Gemini API Key Here>
Create a OAI_CONFIG_LIST.json
file in the root directory and include the following structures containing keys with your respective API credentials:
[
{
"model": "gpt-4o-mini",
"api_key": "<Your API Key Here>"
},
{
"model": "gpt-4o",
"api_key": "<Your API Key Here>"
},
{
"model": "gemini-1.5-pro",
"api_key": "<Your API Key Here>",
"api_type": "google"
},
{
"model": "gemini-1.5-flash",
"api_key": "<Your API Key Here>",
"api_type": "google"
}
]
Make sure to replace <Your API Key Here>
with your actual API keys for each model.
To run the multi-agent system using the front-end panel interface, execute the following command:
panel serve Conversations_UI.py
Completed by Taahir Kolia
and Muhammad Raees Dindar
as part of the ELEN4012A Investigation