-
Notifications
You must be signed in to change notification settings - Fork 12
Port mini-swe-agent #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…p packages and download/install tests.
| cmd.extend(["--cost-limit", str(cost_limit)]) | ||
|
|
||
| # build user prompt | ||
| prompt, has_assistant_response = build_user_prompt(state.messages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is has_assistant_response ever used / needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@celiawaggoner not in this case (neither is the while True structure) because of the limitation of the single-turn conversation (in the PR description). I left it mostly to keep the structure of the other agents and to then extend to the multi-turn version should you think that is part of the scope for this project. Otherwise I'll clean this function up a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks for the context Ana! Let's leave this thread open and we can get input from this repo's maintainers on how if they have a preference between including the multi-turn version in this PR, or as a follow-up.
Jay-Bailey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big thing is the error I'm getting around Python 3 - unsure why. I would call that just a TB2 issue and say maybe mini-swe-agent doesn't have to be compatible with it, but since TB2 specifically was added and presumably works for you, this should be addressed before we merge it.
|
|
||
|
|
||
| @task | ||
| def terminal_bench_task( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running this with uv run --no-sync inspect eval examples/terminal_bench_test --limit 1 (no-sync needed to prevent aisitools from trying to install)
I get:
/home/ec2-user/inspect_swe/src/inspect_swe/_util/agentwheel.py:140 in detect_python_version │
│ │
│ 137 │ # Detects Python version in sandbox. │
│ 138 │ result = await sandbox.exec(bash_command("python3 --version"), user=user) │
│ 139 │ if not result.success: │
│ ❱ 140 │ │ raise RuntimeError("Python 3 not found in sandbox (required for agent)") │
│ 141 │ │
│ 142 │ # Parse "Python 3.12.0" -> "312" │
│ 143 │ match = re.search(r"Python (\d+).(\d+)", result.stdout) │
╰──────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Python 3 not found in sandbox (required for agent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jay-Bailey The code was assuming python installed in the sandbox which didn't work for the particular task limit 1 pulled. I'm now working on changes to use uv in the sandbox instead to avoid this issue for sandboxes with no python (should have it done by tomorrow).
In the meantime, if you run your command with --T eval_names='["break-filter-js-from-html"] should work (this particular task has python installed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm this command now runs, even without the --T.
| attempt_count = 0 | ||
|
|
||
| while True: # Kept for consistency with other agents but currently only single attempt supported | ||
| # TODO: build command with task. This only works for single-turn tasks currently. Need to update to support multi-turn (perhaps via file output option) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a todo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jay-Bailey yes - see Limitations in the PR description. I'm still waiting to know wether this should be part of the scope for this PR or done separately.
| if model is not None | ||
| else get_model().name, | ||
| "OPENAI_API_BASE": f"http://localhost:{bridge.port}/v1", | ||
| "OPENAI_API_KEY": "sk-none", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended as the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is just so the agent detects a key, the real one will be passed by inspect.
|
FYI Some codex-cli tests seem to be failing |
Jay-Bailey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now looks good and the command runs. I am not sure what my approval does to your ability to merge, but I trust that Meridian has a setup they're happy with around it. I am happy with the PR. Thanks for your patience!
|
|
||
|
|
||
| @task | ||
| def terminal_bench_task( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm this command now runs, even without the --T.
|
@jjallaire this is ready for review - can someone from the Meridian team take a look? There are also a couple of open questions about scope mentioned in the PR description. |
|
Hi @celiawaggoner, fantastic to see this ready to go! I will review this either today or early next week. A couple of things to put on your radar pre-review:
|
|
Thanks @jjallaire! @anaoaktree can you take a look at JJ's comments / questions and work on the updates? |
Summary
Add mini-swe-agent https://github.com/SWE-agent/mini-swe-agent as a third agent integration for inspect-swe github.com/meridianlabs-ai/inspect_swe, alongside claude_code and codex_cli. Note that mini-swe-agent is python based which required different considerations to existing binary executable agents.
Important: mini-swe will have a v2.0 with breaking changes https://mini-swe-agent.com/latest/advanced/v2_migration/. No migration guide at the time of writing.
Closes #18
[UPDATE] Results over one run for each task
Compared with terminal bench reported results using gpt-5-mini
Random five challenges
Results in line with TB
Four pass challenges
Results in line with TB except for
fix-git.Key changes
mini-swe-agent CLI
_mini_swe_agent/mini_swe_agent.pyfile utilising most of the same control structures as existing agents.1.17.4which is the latest v1 stable version, monitoring v2.0 release.Package Installation Pattern
pip wheel download on host → tarball → uv offline install in sandbox
Note: Created new
_util/agentwheel.pyfile with potential to expand into a more robust shared utility for other python agents with improved data structures. Suggested improvement after 2nd python agents gets ported.Tests
test_download.pyandtest_ensure_wheels_install.pyto test sandbox and package setup (no need for API keys)test_system_explorer.pyto support mini-swe-agentexamples/terminal_bench_test/to test terminal bench tasks with the agents for validation. Challenges were picked based on their consistency on the official leaderboard and cost running. Picked break-filter-js-from-html that always failed and constraints-scheduling that always passed.Limitations
No built-in session
Testing
claude_codeandcodex_cliand improve code reusabilityNext steps & Open questions