Merge branch 'main' into better-layering-2

tadamcz · web-flow · commit ec0f86d6f47c · 2025-06-30T13:34:00.000+01:00
diff --git a/.github/workflows/ghcr_retention.yaml b/.github/workflows/ghcr_retention.yaml
diff --git a/README.md b/README.md
@@ -1,143 +1 @@
-<p align="center">
-  <a href="https://github.com/princeton-nlp/Llamao">
-    <img src="assets/figures/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
-  </a>
-</p>
-
-<div align="center">
-
- | [日本語](docs/README_JP.md) | [English](https://github.com/princeton-nlp/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
-
-</div>
-
-
----
-<p align="center">
-Code and data for our ICLR 2024 paper <a href="http://swe-bench.github.io/paper.pdf">SWE-bench: Can Language Models Resolve Real-World GitHub Issues?</a>
-    </br>
-    </br>
-    <a href="https://www.python.org/">
-        <img alt="Build" src="https://img.shields.io/badge/Python-3.8+-1f425f.svg?color=purple">
-    </a>
-    <a href="https://copyright.princeton.edu/policy">
-        <img alt="License" src="https://img.shields.io/badge/License-MIT-blue">
-    </a>
-    <a href="https://badge.fury.io/py/swebench">
-        <img src="https://badge.fury.io/py/swebench.svg">
-    </a>
-</p>
-
-Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/princeton-nlp/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.
-
-## 📰 News
-* **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)!
-* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).
-* **[Apr. 15, 2024]**: SWE-bench has gone through major improvements to resolve issues with the evaluation harness. Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240415_eval_bug/README.md).
-* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/princeton-nlp/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
-* **[Jan. 16, 2024]**: SWE-bench has been accepted to ICLR 2024 as an oral presentation! ([OpenReview 🔗](https://openreview.net/forum?id=VTF8yNQM66))
-
-## 👋 Overview
-SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub.
-Given a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem.
-
-<img src="assets/figures/teaser.png">
-
-To access SWE-bench, copy and run the following code:
-```python
-from datasets import load_dataset
-swebench = load_dataset('princeton-nlp/SWE-bench', split='test')
-```
-
-## 🚀 Set Up
-SWE-bench uses Docker for reproducible evaluations.
-Follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.
-If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.
-
-Finally, to build SWE-bench from source, follow these steps:
-```bash
-git clone git@github.com:princeton-nlp/SWE-bench.git
-cd SWE-bench
-pip install -e .
-```
-
-Test your installation by running:
-```bash
-python -m swebench.harness.run_evaluation \
-    --predictions_path gold \
-    --max_workers 1 \
-    --instance_ids sympy__sympy-20590 \
-    --run_id validate-gold
-```
-
-## 💽 Usage
-> [!WARNING]
-> Running fast evaluations on SWE-bench can be resource intensive
-> We recommend running the evaluation harness on an `x86_64` machine with at least 120GB of free storage, 16GB of RAM, and 8 CPU cores.
-> You may need to experiment with the `--max_workers` argument to find the optimal number of workers for your machine, but we recommend using fewer than `min(0.75 * os.cpu_count(), 24)`.
->
-> If running with docker desktop, make sure to increase your virtual disk space to have ~120 free GB available, and set max_workers to be consistent with the above for the CPUs available to docker.
->
-> Support for `arm64` machines is experimental.
-
-Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
-```bash
-python -m swebench.harness.run_evaluation \
-    --dataset_name princeton-nlp/SWE-bench_Lite \
-    --predictions_path <path_to_predictions> \
-    --max_workers <num_workers> \
-    --run_id <run_id>
-    # use --predictions_path 'gold' to verify the gold patches
-    # use --run_id to name the evaluation run
-```
-
-This command will generate docker build logs (`logs/build_images`) and evaluation logs (`logs/run_evaluation`) in the current directory.
-
-The final evaluation results will be stored in the `evaluation_results` directory.
-
-To see the full list of arguments for the evaluation harness, run:
-```bash
-python -m swebench.harness.run_evaluation --help
-```
-
-Additionally, the SWE-Bench repo can help you:
-* Train your own models on our pre-processed datasets
-* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (either models you have on-disk like LLaMA, or models you have access to through an API like GPT-4). The inference step is where you get a repo and an issue and have the model try to generate a fix for it.
-*  Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) on your own repositories, to make new SWE-Bench tasks.
-
-## ⬇️ Downloads
-| Datasets | Models |
-| - | - |
-| [🤗 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) |
-| [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) |
-| [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) |
-| [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) |
-| [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) | |
-| [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama)   | |
-
-## 🍎 Tutorials
-We've also written the following blog posts on how to use different parts of SWE-bench.
-If you'd like to see a post about a particular topic, please let us know via an issue.
-* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/collection.md))
-* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md))
-
-## 💫 Contributions
-We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and we welcome any contributions, pull requests, or issues!
-To do so, please either file a new pull request or issue and fill in the corresponding templates accordingly. We'll be sure to follow up shortly!
-
-Contact person: [Carlos E. Jimenez](http://www.carlosejimenez.com/) and [John Yang](https://john-b-yang.github.io/) (Email: carlosej@princeton.edu, johnby@stanford.edu).
-
-## ✍️ Citation
-If you find our work helpful, please use the following citations.
-```
-@inproceedings{
-    jimenez2024swebench,
-    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
-    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
-    booktitle={The Twelfth International Conference on Learning Representations},
-    year={2024},
-    url={https://openreview.net/forum?id=VTF8yNQM66}
-}
-```
-
-## 🪪 License
-MIT. Check `LICENSE.md`.
+Docker image registry for SWE-bench, created by [Epoch AI](https://epochai.org).
diff --git a/README_upstream.md b/README_upstream.md
@@ -0,0 +1,143 @@
+<p align="center">
+  <a href="https://github.com/princeton-nlp/Llamao">
+    <img src="assets/figures/swellama_banner.png" width="50%" alt="Kawi the SWE-Llama" />
+  </a>
+</p>
+
+<div align="center">
+
+ | [日本語](docs/README_JP.md) | [English](https://github.com/princeton-nlp/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
+
+</div>
+
+
+---
+<p align="center">
+Code and data for our ICLR 2024 paper <a href="http://swe-bench.github.io/paper.pdf">SWE-bench: Can Language Models Resolve Real-World GitHub Issues?</a>
+    </br>
+    </br>
+    <a href="https://www.python.org/">
+        <img alt="Build" src="https://img.shields.io/badge/Python-3.8+-1f425f.svg?color=purple">
+    </a>
+    <a href="https://copyright.princeton.edu/policy">
+        <img alt="License" src="https://img.shields.io/badge/License-MIT-blue">
+    </a>
+    <a href="https://badge.fury.io/py/swebench">
+        <img src="https://badge.fury.io/py/swebench.svg">
+    </a>
+</p>
+
+Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/princeton-nlp/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.
+
+## 📰 News
+* **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)!
+* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).
+* **[Apr. 15, 2024]**: SWE-bench has gone through major improvements to resolve issues with the evaluation harness. Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240415_eval_bug/README.md).
+* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/princeton-nlp/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
+* **[Jan. 16, 2024]**: SWE-bench has been accepted to ICLR 2024 as an oral presentation! ([OpenReview 🔗](https://openreview.net/forum?id=VTF8yNQM66))
+
+## 👋 Overview
+SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub.
+Given a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem.
+
+<img src="assets/figures/teaser.png">
+
+To access SWE-bench, copy and run the following code:
+```python
+from datasets import load_dataset
+swebench = load_dataset('princeton-nlp/SWE-bench', split='test')
+```
+
+## 🚀 Set Up
+SWE-bench uses Docker for reproducible evaluations.
+Follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.
+If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.
+
+Finally, to build SWE-bench from source, follow these steps:
+```bash
+git clone git@github.com:princeton-nlp/SWE-bench.git
+cd SWE-bench
+pip install -e .
+```
+
+Test your installation by running:
+```bash
+python -m swebench.harness.run_evaluation \
+    --predictions_path gold \
+    --max_workers 1 \
+    --instance_ids sympy__sympy-20590 \
+    --run_id validate-gold
+```
+
+## 💽 Usage
+> [!WARNING]
+> Running fast evaluations on SWE-bench can be resource intensive
+> We recommend running the evaluation harness on an `x86_64` machine with at least 120GB of free storage, 16GB of RAM, and 8 CPU cores.
+> You may need to experiment with the `--max_workers` argument to find the optimal number of workers for your machine, but we recommend using fewer than `min(0.75 * os.cpu_count(), 24)`.
+>
+> If running with docker desktop, make sure to increase your virtual disk space to have ~120 free GB available, and set max_workers to be consistent with the above for the CPUs available to docker.
+>
+> Support for `arm64` machines is experimental.
+
+Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
+```bash
+python -m swebench.harness.run_evaluation \
+    --dataset_name princeton-nlp/SWE-bench_Lite \
+    --predictions_path <path_to_predictions> \
+    --max_workers <num_workers> \
+    --run_id <run_id>
+    # use --predictions_path 'gold' to verify the gold patches
+    # use --run_id to name the evaluation run
+```
+
+This command will generate docker build logs (`logs/build_images`) and evaluation logs (`logs/run_evaluation`) in the current directory.
+
+The final evaluation results will be stored in the `evaluation_results` directory.
+
+To see the full list of arguments for the evaluation harness, run:
+```bash
+python -m swebench.harness.run_evaluation --help
+```
+
+Additionally, the SWE-Bench repo can help you:
+* Train your own models on our pre-processed datasets
+* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (either models you have on-disk like LLaMA, or models you have access to through an API like GPT-4). The inference step is where you get a repo and an issue and have the model try to generate a fix for it.
+*  Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) on your own repositories, to make new SWE-Bench tasks.
+
+## ⬇️ Downloads
+| Datasets | Models |
+| - | - |
+| [🤗 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) |
+| [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) |
+| [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) |
+| [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) |
+| [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) | |
+| [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama)   | |
+
+## 🍎 Tutorials
+We've also written the following blog posts on how to use different parts of SWE-bench.
+If you'd like to see a post about a particular topic, please let us know via an issue.
+* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/collection.md))
+* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md))
+
+## 💫 Contributions
+We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and we welcome any contributions, pull requests, or issues!
+To do so, please either file a new pull request or issue and fill in the corresponding templates accordingly. We'll be sure to follow up shortly!
+
+Contact person: [Carlos E. Jimenez](http://www.carlosejimenez.com/) and [John Yang](https://john-b-yang.github.io/) (Email: carlosej@princeton.edu, johnby@stanford.edu).
+
+## ✍️ Citation
+If you find our work helpful, please use the following citations.
+```
+@inproceedings{
+    jimenez2024swebench,
+    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
+    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
+    booktitle={The Twelfth International Conference on Learning Representations},
+    year={2024},
+    url={https://openreview.net/forum?id=VTF8yNQM66}
+}
+```
+
+## 🪪 License
+MIT. Check `LICENSE.md`.