Port mini-swe-agent #21

anaoaktree · 2026-01-16T16:30:50Z

Summary

Add mini-swe-agent https://github.com/SWE-agent/mini-swe-agent as a third agent integration for inspect-swe github.com/meridianlabs-ai/inspect_swe, alongside claude_code and codex_cli. Note that mini-swe-agent is python based which required different considerations to existing binary executable agents.

Important: mini-swe will have a v2.0 with breaking changes https://mini-swe-agent.com/latest/advanced/v2_migration/. No migration guide at the time of writing.

Closes #18

[UPDATE] Results over one run for each task

Compared with terminal bench reported results using gpt-5-mini

Random five challenges

uv run --no-sync inspect eval examples/terminal_bench_test --limit 5 --model openai/gpt-5-mini

Results in line with TB

Four pass challenges

uv run --no-sync inspect eval examples/terminal_bench_test --model openai/gpt-5-mini  -T eval_names='["constraints-scheduling", "fix-git", "git-multibranch", "prove-plus-comm"]'

Results in line with TB except for fix-git.

Key changes

mini-swe-agent CLI

New _mini_swe_agent/mini_swe_agent.py file utilising most of the same control structures as existing agents.
Currently pinned to 1.17.4 which is the latest v1 stable version, monitoring v2.0 release.
Consider adding version validation to warn if user requests v2.x (once released)

mini \
  --yolo \              # Run without confirmations
  --exit-immediately \  # Exit when done (non-interactive)
  --cost-limit N \      # Optional cost limit
  --task "PROMPT"       # The task to execute

Package Installation Pattern

pip wheel download on host → tarball → uv offline install in sandbox

using pip to download wheels on host, uv to install for sandboxes without python.
Follows the similar host-download/sandbox-install pattern as binary agents. Mirrored some of the existing utilities but for wheels
Detects platform and python version, pip manages version resolution

Note: Created new _util/agentwheel.py file with potential to expand into a more robust shared utility for other python agents with improved data structures. Suggested improvement after 2nd python agents gets ported.

Tests

Added test_download.py and test_ensure_wheels_install.py to test sandbox and package setup (no need for API keys)
Extended test_system_explorer.py to support mini-swe-agent
Created examples/terminal_bench_test/ to test terminal bench tasks with the agents for validation. Challenges were picked based on their consistency on the official leaderboard and cost running. Picked break-filter-js-from-html that always failed and constraints-scheduling that always passed.

Limitations

No built-in session

mini-swe-agent doesn't natively handle multi-turn conversation (From their docs: There is no running shell session. Every action is executed as a subprocess, that means every action is independent of the previous ones (it is literally a subprocess.run/os.system/docker exec call). )
[WIP] Current version handles single turn prompts (--task option). To be modified for use with multiple attempts. Options include save trajectory file to --output constructing history manually. Requires further investigation

Testing

Only tested with gpt-5-mini - Support/test local models
Extend new tests for existing agents claude_code and codex_cli and improve code reusability

Next steps & Open questions

Feedback from this PR
Extend testing - Could add more tests from existing evals to validate agent outcome (not just agent infrastructure) and expand to existing agents
Working on the limitations
Potential to simplify common control structures (needs more investigation, considered out of scope for now) and simplify some of the longer functions.

…p packages and download/install tests.

…enge.

celiawaggoner · 2026-01-18T21:20:04Z

src/inspect_swe/_mini_swe_agent/mini_swe_agent.py

+                cmd.extend(["--cost-limit", str(cost_limit)])
+
+            # build user prompt
+            prompt, has_assistant_response = build_user_prompt(state.messages)


is has_assistant_response ever used / needed?

@celiawaggoner not in this case (neither is the while True structure) because of the limitation of the single-turn conversation (in the PR description). I left it mostly to keep the structure of the other agents and to then extend to the multi-turn version should you think that is part of the scope for this project. Otherwise I'll clean this function up a bit.

Gotcha, thanks for the context Ana! Let's leave this thread open and we can get input from this repo's maintainers on how if they have a preference between including the multi-turn version in this PR, or as a follow-up.

src/inspect_swe/_mini_swe_agent/mini_swe_agent.py

src/inspect_swe/_util/agentwheel.py

tests/test_download.py

tests/test_ensure_wheels_install.py

src/inspect_swe/_mini_swe_agent/mini_swe_agent.py

Jay-Bailey

The big thing is the error I'm getting around Python 3 - unsure why. I would call that just a TB2 issue and say maybe mini-swe-agent doesn't have to be compatible with it, but since TB2 specifically was added and presumably works for you, this should be addressed before we merge it.

Jay-Bailey · 2026-01-23T06:29:00Z

examples/terminal_bench_test/task.py

+
+
+@task
+def terminal_bench_task(


When running this with uv run --no-sync inspect eval examples/terminal_bench_test --limit 1 (no-sync needed to prevent aisitools from trying to install)

I get:

/home/ec2-user/inspect_swe/src/inspect_swe/_util/agentwheel.py:140 in detect_python_version │
│ │
│ 137 │ # Detects Python version in sandbox. │
│ 138 │ result = await sandbox.exec(bash_command("python3 --version"), user=user) │
│ 139 │ if not result.success: │
│ ❱ 140 │ │ raise RuntimeError("Python 3 not found in sandbox (required for agent)") │
│ 141 │ │
│ 142 │ # Parse "Python 3.12.0" -> "312" │
│ 143 │ match = re.search(r"Python (\d+).(\d+)", result.stdout) │
╰──────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Python 3 not found in sandbox (required for agent)

@Jay-Bailey The code was assuming python installed in the sandbox which didn't work for the particular task limit 1 pulled. I'm now working on changes to use uv in the sandbox instead to avoid this issue for sandboxes with no python (should have it done by tomorrow).

In the meantime, if you run your command with --T eval_names='["break-filter-js-from-html"] should work (this particular task has python installed).

I can confirm this command now runs, even without the --T.

Jay-Bailey · 2026-01-23T06:30:27Z

src/inspect_swe/_mini_swe_agent/mini_swe_agent.py

+            attempt_count = 0
+
+            while True:  # Kept for consistency with other agents but currently only single attempt supported
+                # TODO: build command with task. This only works for single-turn tasks currently. Need to update to support multi-turn (perhaps via file output option)


Is this still a todo?

@Jay-Bailey yes - see Limitations in the PR description. I'm still waiting to know wether this should be part of the scope for this PR or done separately.

Jay-Bailey · 2026-01-23T06:30:46Z

src/inspect_swe/_mini_swe_agent/mini_swe_agent.py

+                        if model is not None
+                        else get_model().name,
+                        "OPENAI_API_BASE": f"http://localhost:{bridge.port}/v1",
+                        "OPENAI_API_KEY": "sk-none",


Is this intended as the key?

it is just so the agent detects a key, the real one will be passed by inspect.

src/inspect_swe/_mini_swe_agent/mini_swe_agent.py

tests/conftest.py

tests/test_download.py

anaoaktree · 2026-01-30T13:26:20Z

FYI Some codex-cli tests seem to be failing

Jay-Bailey

This now looks good and the command runs. I am not sure what my approval does to your ability to merge, but I trust that Meridian has a setup they're happy with around it. I am happy with the PR. Thanks for your patience!

Jay-Bailey · 2026-02-05T00:43:06Z

examples/terminal_bench_test/task.py

+
+
+@task
+def terminal_bench_task(


I can confirm this command now runs, even without the --T.

celiawaggoner · 2026-02-05T01:05:55Z

@jjallaire this is ready for review - can someone from the Meridian team take a look? There are also a couple of open questions about scope mentioned in the PR description.

jjallaire · 2026-02-06T13:07:41Z

Hi @celiawaggoner, fantastic to see this ready to go! I will review this either today or early next week.

A couple of things to put on your radar pre-review:

Noted the limitation about multi-turn. We definitely want to support multi-turn as well as resume semantics as we do in the other agents.
Does mini-see-agent have any built in compaction? If not, you could easily add this by taking a compaction strategy and passing it on the the bridge (which actually takes care of all of the mechanics of compaction).

celiawaggoner · 2026-02-06T19:47:29Z

Thanks @jjallaire! @anaoaktree can you take a look at JJ's comments / questions and work on the updates?

anaoaktree added 5 commits January 15, 2026 16:07

registry setup and mini-swe setup in sandbox with abstractions for pi…

60822b6

…p packages and download/install tests.

minor improvements to package install and agent setup.

e97579b

[WIP] mini-swe-agent v0 and test system explorer/terminal bench chall…

d751e45

…enge.

fixed model and env vars for agent. updated terminal bench challenges

8e08ab1

Merge remote-tracking branch 'upstream/main' into mini-swe-agent

38f4c5b

anaoaktree marked this pull request as draft January 16, 2026 16:31

celiawaggoner reviewed Jan 18, 2026

View reviewed changes

better exception catching and fixed patern matching

07ae50a

jrh-mann reviewed Jan 22, 2026

View reviewed changes

src/inspect_swe/_mini_swe_agent/mini_swe_agent.py Show resolved Hide resolved

anaoaktree added 4 commits January 22, 2026 09:14

fixed sandbox duplicate call, unused var and get model name

8cfbb6f

improved python versioning and swe-mini version detection tests

5f8dfe6

improved cache testing, fixtures and other small fixes

6c956b6

mypy fixes

d93db8e

Jay-Bailey suggested changes Jan 23, 2026

View reviewed changes

refactored pip -> uv plus small tweaks

1c4b09d

anaoaktree changed the title ~~[DRAFT] Port mini-swe-agent~~ Port mini-swe-agent Jan 30, 2026

anaoaktree added 2 commits January 30, 2026 11:04

Merge remote-tracking branch 'upstream/main' into mini-swe-agent

f56c47e

removed unused sandbox python detection

cbeea66

anaoaktree marked this pull request as ready for review January 30, 2026 13:25

anaoaktree requested a review from Jay-Bailey January 30, 2026 13:25

Jay-Bailey approved these changes Feb 5, 2026

View reviewed changes

examples/terminal_bench_test/task.py

@task

def terminal_bench_task(

Copy link

Jay-Bailey Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm this command now runs, even without the --T.

Port mini-swe-agent #21

Are you sure you want to change the base?

Port mini-swe-agent #21

Uh oh!

Conversation

anaoaktree commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

[UPDATE] Results over one run for each task

Random five challenges

Four pass challenges

Key changes

mini-swe-agent CLI

Package Installation Pattern

Tests

Limitations

No built-in session

Testing

Next steps & Open questions

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jay-Bailey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anaoaktree commented Jan 30, 2026

Uh oh!

Jay-Bailey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

celiawaggoner commented Feb 5, 2026

Uh oh!

jjallaire commented Feb 6, 2026

Uh oh!

celiawaggoner commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anaoaktree commented Jan 16, 2026 •

edited

Loading