Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

ShaneTian · 2025-02-28T06:53:55Z

Description:
I am trying to run the SWE-Lancer benchmark on my system, but I would like to confirm if others have successfully completed the process.

System Details:

OS: CentOS Linux 7 64bit / Linux 4.14.0_1-0-0-51
CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz 3.10/0.00GHz, 96 core
MEM: 377GB
Disk: 8 * 7.3TB
Docker: 27.0.0-rc.2

Steps Taken

✅ Step 1: Environment Setup (Success)

uv sync
source .venv/bin/activate
for proj in nanoeval alcatraz nanoeval_alcatraz; do
  uv pip install -e project/"$proj"
done

✅ Step 2: Docker Build (Success)

docker buildx build \
  -f Dockerfile_x86 \
  --platform linux/amd64 \
  --ssh default=$SSH_AUTH_SOCK \
  --network host \
  -t swelancer \
  .

✅ Step 3: Running Container (Success)

Using ISSUE_ID=1 environment variable #44 (comment)

docker run -itd --name swelancer-runtime -p 5900:5900 -p 5901:5901 -e ISSUE_ID=1 swelancer

❓ Step 4: Running SWE-Lancer (Partial Failure)

uv run python run_swelancer.py --issue_ids 1 2 3 4 5 6 7 8 9 10 11

I used the gpt-4o-2024-11-20 model and ran the first 11 issues. Each question takes an average of 20 minutes to an hour.
There are 7 issues that can work properly (1, 2, 4, 5, 8, 10, 11), but all the results are failures. Other 4 issues do not work properly and will raise errors (3 6 7 9).
I am not sure if there is something wrong with my way.

Thanks in advance! 🚀

The text was updated successfully, but these errors were encountered:

Lucky-w0y · 2025-03-03T07:15:43Z

Hello, I am also running the SWE-Lancer. However my user-tool can't not run successfully. Did you meet the error: bash: cannot set terminal process group (17647): Inappropriate ioctl for device
bash: no job control in this shell.when the agent calls the user-tool?

BoxiYu · 2025-03-03T12:54:16Z

Hi guys, I have successfully run the Swelancer examples. If you are also using x86 architecture, you can download my pre-built Docker image here: https://hub.docker.com/repository/docker/cccav/swelancer_x86/general.

Wish you good luck!

moresearch · 2025-03-04T14:13:52Z

@BoxiYu thats great, could you please give a hint about the cost of running the examples? why x86?

BoxiYu · 2025-03-04T14:21:18Z

@moresearch hi, I only run it with the two examples, about an average of 2 dollars maybe, I did not record it precisely. The image I provided at the link is built on an x86 cloud server, and I have smoothly used it on my x86 laptop. I did not build it successfully on my arm device (it might be due to the network error).

moresearch · 2025-03-04T18:56:33Z

@BoxiYu you think we could collectively as a community gather cost data per different model/agent-implementation? Would you be interested in participating in such endeavour?

petebachant mentioned this issue Mar 4, 2025

Improve reproducibility and reduce manual setup steps with Calkit #54

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

ShaneTian commented Feb 28, 2025 •

edited

Loading

Lucky-w0y commented Mar 3, 2025

BoxiYu commented Mar 3, 2025

moresearch commented Mar 4, 2025

BoxiYu commented Mar 4, 2025

moresearch commented Mar 4, 2025

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

Comments

ShaneTian commented Feb 28, 2025 • edited Loading

Steps Taken

✅ Step 1: Environment Setup (Success)

✅ Step 2: Docker Build (Success)

✅ Step 3: Running Container (Success)

❓ Step 4: Running SWE-Lancer (Partial Failure)

Lucky-w0y commented Mar 3, 2025

BoxiYu commented Mar 3, 2025

moresearch commented Mar 4, 2025

BoxiYu commented Mar 4, 2025

moresearch commented Mar 4, 2025

ShaneTian commented Feb 28, 2025 •

edited

Loading