-
Notifications
You must be signed in to change notification settings - Fork 21
SWE‐bench
Christopher David edited this page Mar 17, 2024
·
19 revisions
-
SWE-bench is a benchmark suite for real-world software engineering.
- "Can Language Models Resolve Real-World GitHub Issues?"
-
Cognition Labs / Devin just claimed a 13.86% success rate.
- It's not "official" til it's on the swebench.com leaderboard but we'll assume this is SOTA and the number to beat
- Blog: Devin SWE-bench summary
- GitHub: Devin SWE-bench results & evaluation harness
This is what we want to beat:
- Analyze Devin's results repo
- What are the patterns of success/failure?
- What can we learn of their approach?
- Agentic loop with GPT-4, RAG, and ___?
- What architecture/algorithms are needed for success - and how best to build them?
- Set up benchmark
- Run basic agent to get an initial benchmark
- Iterate until solving an issue Devin didn't
python run_api.py \
--dataset_name_or_path princeton-nlp/SWE-bench \
--model_name_or_path gpt-4-0613 \
--output_dir results \
--model_args "temperature=0.7,top_p=0.95" \
--max_cost 200