Skip to content

Commit ec0f86d

Browse files
authored
Merge branch 'main' into better-layering-2
2 parents f7007eb + a42c3c8 commit ec0f86d

File tree

3 files changed

+164
-143
lines changed

3 files changed

+164
-143
lines changed

.github/workflows/ghcr_retention.yaml

Lines changed: 20 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 1 addition & 143 deletions
Original file line numberDiff line numberDiff line change
@@ -1,143 +1 @@
1-
<p align="center">
2-
<a href="https://github.com/princeton-nlp/Llamao">
3-
<img src="assets/figures/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
4-
</a>
5-
</p>
6-
7-
<div align="center">
8-
9-
| [日本語](docs/README_JP.md) | [English](https://github.com/princeton-nlp/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
10-
11-
</div>
12-
13-
14-
---
15-
<p align="center">
16-
Code and data for our ICLR 2024 paper <a href="http://swe-bench.github.io/paper.pdf">SWE-bench: Can Language Models Resolve Real-World GitHub Issues?</a>
17-
</br>
18-
</br>
19-
<a href="https://www.python.org/">
20-
<img alt="Build" src="https://img.shields.io/badge/Python-3.8+-1f425f.svg?color=purple">
21-
</a>
22-
<a href="https://copyright.princeton.edu/policy">
23-
<img alt="License" src="https://img.shields.io/badge/License-MIT-blue">
24-
</a>
25-
<a href="https://badge.fury.io/py/swebench">
26-
<img src="https://badge.fury.io/py/swebench.svg">
27-
</a>
28-
</p>
29-
30-
Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/princeton-nlp/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.
31-
32-
## 📰 News
33-
* **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)!
34-
* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).
35-
* **[Apr. 15, 2024]**: SWE-bench has gone through major improvements to resolve issues with the evaluation harness. Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240415_eval_bug/README.md).
36-
* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/princeton-nlp/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
37-
* **[Jan. 16, 2024]**: SWE-bench has been accepted to ICLR 2024 as an oral presentation! ([OpenReview 🔗](https://openreview.net/forum?id=VTF8yNQM66))
38-
39-
## 👋 Overview
40-
SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub.
41-
Given a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem.
42-
43-
<img src="assets/figures/teaser.png">
44-
45-
To access SWE-bench, copy and run the following code:
46-
```python
47-
from datasets import load_dataset
48-
swebench = load_dataset('princeton-nlp/SWE-bench', split='test')
49-
```
50-
51-
## 🚀 Set Up
52-
SWE-bench uses Docker for reproducible evaluations.
53-
Follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.
54-
If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.
55-
56-
Finally, to build SWE-bench from source, follow these steps:
57-
```bash
58-
git clone [email protected]:princeton-nlp/SWE-bench.git
59-
cd SWE-bench
60-
pip install -e .
61-
```
62-
63-
Test your installation by running:
64-
```bash
65-
python -m swebench.harness.run_evaluation \
66-
--predictions_path gold \
67-
--max_workers 1 \
68-
--instance_ids sympy__sympy-20590 \
69-
--run_id validate-gold
70-
```
71-
72-
## 💽 Usage
73-
> [!WARNING]
74-
> Running fast evaluations on SWE-bench can be resource intensive
75-
> We recommend running the evaluation harness on an `x86_64` machine with at least 120GB of free storage, 16GB of RAM, and 8 CPU cores.
76-
> You may need to experiment with the `--max_workers` argument to find the optimal number of workers for your machine, but we recommend using fewer than `min(0.75 * os.cpu_count(), 24)`.
77-
>
78-
> If running with docker desktop, make sure to increase your virtual disk space to have ~120 free GB available, and set max_workers to be consistent with the above for the CPUs available to docker.
79-
>
80-
> Support for `arm64` machines is experimental.
81-
82-
Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
83-
```bash
84-
python -m swebench.harness.run_evaluation \
85-
--dataset_name princeton-nlp/SWE-bench_Lite \
86-
--predictions_path <path_to_predictions> \
87-
--max_workers <num_workers> \
88-
--run_id <run_id>
89-
# use --predictions_path 'gold' to verify the gold patches
90-
# use --run_id to name the evaluation run
91-
```
92-
93-
This command will generate docker build logs (`logs/build_images`) and evaluation logs (`logs/run_evaluation`) in the current directory.
94-
95-
The final evaluation results will be stored in the `evaluation_results` directory.
96-
97-
To see the full list of arguments for the evaluation harness, run:
98-
```bash
99-
python -m swebench.harness.run_evaluation --help
100-
```
101-
102-
Additionally, the SWE-Bench repo can help you:
103-
* Train your own models on our pre-processed datasets
104-
* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (either models you have on-disk like LLaMA, or models you have access to through an API like GPT-4). The inference step is where you get a repo and an issue and have the model try to generate a fix for it.
105-
* Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) on your own repositories, to make new SWE-Bench tasks.
106-
107-
## ⬇️ Downloads
108-
| Datasets | Models |
109-
| - | - |
110-
| [🤗 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) |
111-
| [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) |
112-
| [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) |
113-
| [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) |
114-
| [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) | |
115-
| [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama) | |
116-
117-
## 🍎 Tutorials
118-
We've also written the following blog posts on how to use different parts of SWE-bench.
119-
If you'd like to see a post about a particular topic, please let us know via an issue.
120-
* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/collection.md))
121-
* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md))
122-
123-
## 💫 Contributions
124-
We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and we welcome any contributions, pull requests, or issues!
125-
To do so, please either file a new pull request or issue and fill in the corresponding templates accordingly. We'll be sure to follow up shortly!
126-
127-
Contact person: [Carlos E. Jimenez](http://www.carlosejimenez.com/) and [John Yang](https://john-b-yang.github.io/) (Email: [email protected], [email protected]).
128-
129-
## ✍️ Citation
130-
If you find our work helpful, please use the following citations.
131-
```
132-
@inproceedings{
133-
jimenez2024swebench,
134-
title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
135-
author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
136-
booktitle={The Twelfth International Conference on Learning Representations},
137-
year={2024},
138-
url={https://openreview.net/forum?id=VTF8yNQM66}
139-
}
140-
```
141-
142-
## 🪪 License
143-
MIT. Check `LICENSE.md`.
1+
Docker image registry for SWE-bench, created by [Epoch AI](https://epochai.org).

README_upstream.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
<p align="center">
2+
<a href="https://github.com/princeton-nlp/Llamao">
3+
<img src="assets/figures/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
4+
</a>
5+
</p>
6+
7+
<div align="center">
8+
9+
| [日本語](docs/README_JP.md) | [English](https://github.com/princeton-nlp/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
10+
11+
</div>
12+
13+
14+
---
15+
<p align="center">
16+
Code and data for our ICLR 2024 paper <a href="http://swe-bench.github.io/paper.pdf">SWE-bench: Can Language Models Resolve Real-World GitHub Issues?</a>
17+
</br>
18+
</br>
19+
<a href="https://www.python.org/">
20+
<img alt="Build" src="https://img.shields.io/badge/Python-3.8+-1f425f.svg?color=purple">
21+
</a>
22+
<a href="https://copyright.princeton.edu/policy">
23+
<img alt="License" src="https://img.shields.io/badge/License-MIT-blue">
24+
</a>
25+
<a href="https://badge.fury.io/py/swebench">
26+
<img src="https://badge.fury.io/py/swebench.svg">
27+
</a>
28+
</p>
29+
30+
Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/princeton-nlp/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.
31+
32+
## 📰 News
33+
* **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)!
34+
* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).
35+
* **[Apr. 15, 2024]**: SWE-bench has gone through major improvements to resolve issues with the evaluation harness. Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240415_eval_bug/README.md).
36+
* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/princeton-nlp/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
37+
* **[Jan. 16, 2024]**: SWE-bench has been accepted to ICLR 2024 as an oral presentation! ([OpenReview 🔗](https://openreview.net/forum?id=VTF8yNQM66))
38+
39+
## 👋 Overview
40+
SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub.
41+
Given a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem.
42+
43+
<img src="assets/figures/teaser.png">
44+
45+
To access SWE-bench, copy and run the following code:
46+
```python
47+
from datasets import load_dataset
48+
swebench = load_dataset('princeton-nlp/SWE-bench', split='test')
49+
```
50+
51+
## 🚀 Set Up
52+
SWE-bench uses Docker for reproducible evaluations.
53+
Follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.
54+
If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.
55+
56+
Finally, to build SWE-bench from source, follow these steps:
57+
```bash
58+
git clone [email protected]:princeton-nlp/SWE-bench.git
59+
cd SWE-bench
60+
pip install -e .
61+
```
62+
63+
Test your installation by running:
64+
```bash
65+
python -m swebench.harness.run_evaluation \
66+
--predictions_path gold \
67+
--max_workers 1 \
68+
--instance_ids sympy__sympy-20590 \
69+
--run_id validate-gold
70+
```
71+
72+
## 💽 Usage
73+
> [!WARNING]
74+
> Running fast evaluations on SWE-bench can be resource intensive
75+
> We recommend running the evaluation harness on an `x86_64` machine with at least 120GB of free storage, 16GB of RAM, and 8 CPU cores.
76+
> You may need to experiment with the `--max_workers` argument to find the optimal number of workers for your machine, but we recommend using fewer than `min(0.75 * os.cpu_count(), 24)`.
77+
>
78+
> If running with docker desktop, make sure to increase your virtual disk space to have ~120 free GB available, and set max_workers to be consistent with the above for the CPUs available to docker.
79+
>
80+
> Support for `arm64` machines is experimental.
81+
82+
Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
83+
```bash
84+
python -m swebench.harness.run_evaluation \
85+
--dataset_name princeton-nlp/SWE-bench_Lite \
86+
--predictions_path <path_to_predictions> \
87+
--max_workers <num_workers> \
88+
--run_id <run_id>
89+
# use --predictions_path 'gold' to verify the gold patches
90+
# use --run_id to name the evaluation run
91+
```
92+
93+
This command will generate docker build logs (`logs/build_images`) and evaluation logs (`logs/run_evaluation`) in the current directory.
94+
95+
The final evaluation results will be stored in the `evaluation_results` directory.
96+
97+
To see the full list of arguments for the evaluation harness, run:
98+
```bash
99+
python -m swebench.harness.run_evaluation --help
100+
```
101+
102+
Additionally, the SWE-Bench repo can help you:
103+
* Train your own models on our pre-processed datasets
104+
* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (either models you have on-disk like LLaMA, or models you have access to through an API like GPT-4). The inference step is where you get a repo and an issue and have the model try to generate a fix for it.
105+
* Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) on your own repositories, to make new SWE-Bench tasks.
106+
107+
## ⬇️ Downloads
108+
| Datasets | Models |
109+
| - | - |
110+
| [🤗 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) |
111+
| [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) |
112+
| [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) |
113+
| [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) |
114+
| [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) | |
115+
| [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama) | |
116+
117+
## 🍎 Tutorials
118+
We've also written the following blog posts on how to use different parts of SWE-bench.
119+
If you'd like to see a post about a particular topic, please let us know via an issue.
120+
* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/collection.md))
121+
* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md))
122+
123+
## 💫 Contributions
124+
We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and we welcome any contributions, pull requests, or issues!
125+
To do so, please either file a new pull request or issue and fill in the corresponding templates accordingly. We'll be sure to follow up shortly!
126+
127+
Contact person: [Carlos E. Jimenez](http://www.carlosejimenez.com/) and [John Yang](https://john-b-yang.github.io/) (Email: [email protected], [email protected]).
128+
129+
## ✍️ Citation
130+
If you find our work helpful, please use the following citations.
131+
```
132+
@inproceedings{
133+
jimenez2024swebench,
134+
title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
135+
author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
136+
booktitle={The Twelfth International Conference on Learning Representations},
137+
year={2024},
138+
url={https://openreview.net/forum?id=VTF8yNQM66}
139+
}
140+
```
141+
142+
## 🪪 License
143+
MIT. Check `LICENSE.md`.

0 commit comments

Comments
 (0)