|
1 | | -<p align="center"> |
2 | | - <a href="https://github.com/princeton-nlp/Llamao"> |
3 | | - <img src="assets/figures/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" /> |
4 | | - </a> |
5 | | -</p> |
6 | | - |
7 | | -<div align="center"> |
8 | | - |
9 | | - | [日本語](docs/README_JP.md) | [English](https://github.com/princeton-nlp/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) | |
10 | | - |
11 | | -</div> |
12 | | - |
13 | | - |
14 | | ---- |
15 | | -<p align="center"> |
16 | | -Code and data for our ICLR 2024 paper <a href="http://swe-bench.github.io/paper.pdf">SWE-bench: Can Language Models Resolve Real-World GitHub Issues?</a> |
17 | | - </br> |
18 | | - </br> |
19 | | - <a href="https://www.python.org/"> |
20 | | - <img alt="Build" src="https://img.shields.io/badge/Python-3.8+-1f425f.svg?color=purple"> |
21 | | - </a> |
22 | | - <a href="https://copyright.princeton.edu/policy"> |
23 | | - <img alt="License" src="https://img.shields.io/badge/License-MIT-blue"> |
24 | | - </a> |
25 | | - <a href="https://badge.fury.io/py/swebench"> |
26 | | - <img src="https://badge.fury.io/py/swebench.svg"> |
27 | | - </a> |
28 | | -</p> |
29 | | - |
30 | | -Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/princeton-nlp/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark. |
31 | | - |
32 | | -## 📰 News |
33 | | -* **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)! |
34 | | -* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md). |
35 | | -* **[Apr. 15, 2024]**: SWE-bench has gone through major improvements to resolve issues with the evaluation harness. Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240415_eval_bug/README.md). |
36 | | -* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/princeton-nlp/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582)) |
37 | | -* **[Jan. 16, 2024]**: SWE-bench has been accepted to ICLR 2024 as an oral presentation! ([OpenReview 🔗](https://openreview.net/forum?id=VTF8yNQM66)) |
38 | | - |
39 | | -## 👋 Overview |
40 | | -SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. |
41 | | -Given a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem. |
42 | | - |
43 | | -<img src="assets/figures/teaser.png"> |
44 | | - |
45 | | -To access SWE-bench, copy and run the following code: |
46 | | -```python |
47 | | -from datasets import load_dataset |
48 | | -swebench = load_dataset('princeton-nlp/SWE-bench', split='test') |
49 | | -``` |
50 | | - |
51 | | -## 🚀 Set Up |
52 | | -SWE-bench uses Docker for reproducible evaluations. |
53 | | -Follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine. |
54 | | -If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well. |
55 | | - |
56 | | -Finally, to build SWE-bench from source, follow these steps: |
57 | | -```bash |
58 | | -git clone [email protected]:princeton-nlp/SWE-bench.git |
59 | | -cd SWE-bench |
60 | | -pip install -e . |
61 | | -``` |
62 | | - |
63 | | -Test your installation by running: |
64 | | -```bash |
65 | | -python -m swebench.harness.run_evaluation \ |
66 | | - --predictions_path gold \ |
67 | | - --max_workers 1 \ |
68 | | - --instance_ids sympy__sympy-20590 \ |
69 | | - --run_id validate-gold |
70 | | -``` |
71 | | - |
72 | | -## 💽 Usage |
73 | | -> [!WARNING] |
74 | | -> Running fast evaluations on SWE-bench can be resource intensive |
75 | | -> We recommend running the evaluation harness on an `x86_64` machine with at least 120GB of free storage, 16GB of RAM, and 8 CPU cores. |
76 | | -> You may need to experiment with the `--max_workers` argument to find the optimal number of workers for your machine, but we recommend using fewer than `min(0.75 * os.cpu_count(), 24)`. |
77 | | -> |
78 | | -> If running with docker desktop, make sure to increase your virtual disk space to have ~120 free GB available, and set max_workers to be consistent with the above for the CPUs available to docker. |
79 | | -> |
80 | | -> Support for `arm64` machines is experimental. |
81 | | -
|
82 | | -Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command: |
83 | | -```bash |
84 | | -python -m swebench.harness.run_evaluation \ |
85 | | - --dataset_name princeton-nlp/SWE-bench_Lite \ |
86 | | - --predictions_path <path_to_predictions> \ |
87 | | - --max_workers <num_workers> \ |
88 | | - --run_id <run_id> |
89 | | - # use --predictions_path 'gold' to verify the gold patches |
90 | | - # use --run_id to name the evaluation run |
91 | | -``` |
92 | | - |
93 | | -This command will generate docker build logs (`logs/build_images`) and evaluation logs (`logs/run_evaluation`) in the current directory. |
94 | | - |
95 | | -The final evaluation results will be stored in the `evaluation_results` directory. |
96 | | - |
97 | | -To see the full list of arguments for the evaluation harness, run: |
98 | | -```bash |
99 | | -python -m swebench.harness.run_evaluation --help |
100 | | -``` |
101 | | - |
102 | | -Additionally, the SWE-Bench repo can help you: |
103 | | -* Train your own models on our pre-processed datasets |
104 | | -* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (either models you have on-disk like LLaMA, or models you have access to through an API like GPT-4). The inference step is where you get a repo and an issue and have the model try to generate a fix for it. |
105 | | -* Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) on your own repositories, to make new SWE-Bench tasks. |
106 | | - |
107 | | -## ⬇️ Downloads |
108 | | -| Datasets | Models | |
109 | | -| - | - | |
110 | | -| [🤗 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) | |
111 | | -| [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) | |
112 | | -| [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) | |
113 | | -| [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) | |
114 | | -| [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) | | |
115 | | -| [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama) | | |
116 | | - |
117 | | -## 🍎 Tutorials |
118 | | -We've also written the following blog posts on how to use different parts of SWE-bench. |
119 | | -If you'd like to see a post about a particular topic, please let us know via an issue. |
120 | | -* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/collection.md)) |
121 | | -* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md)) |
122 | | - |
123 | | -## 💫 Contributions |
124 | | -We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and we welcome any contributions, pull requests, or issues! |
125 | | -To do so, please either file a new pull request or issue and fill in the corresponding templates accordingly. We'll be sure to follow up shortly! |
126 | | - |
127 | | -Contact person: [Carlos E. Jimenez ](http://www.carlosejimenez.com/) and [John Yang ](https://john-b-yang.github.io/) (Email: [email protected], [email protected]). |
128 | | - |
129 | | -## ✍️ Citation |
130 | | -If you find our work helpful, please use the following citations. |
131 | | -``` |
132 | | -@inproceedings{ |
133 | | - jimenez2024swebench, |
134 | | - title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?}, |
135 | | - author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan}, |
136 | | - booktitle={The Twelfth International Conference on Learning Representations}, |
137 | | - year={2024}, |
138 | | - url={https://openreview.net/forum?id=VTF8yNQM66} |
139 | | -} |
140 | | -``` |
141 | | - |
142 | | -## 🪪 License |
143 | | -MIT. Check `LICENSE.md`. |
| 1 | +Docker image registry for SWE-bench, created by [Epoch AI](https://epochai.org). |
0 commit comments