Skip to content

Conversation

@anaoaktree
Copy link

@anaoaktree anaoaktree commented Jan 16, 2026

Summary

Add mini-swe-agent https://github.com/SWE-agent/mini-swe-agent as a third agent integration for inspect-swe github.com/meridianlabs-ai/inspect_swe, alongside claude_code and codex_cli. Note that mini-swe-agent is python based which required different considerations to existing binary executable agents.

Important: mini-swe will have a v2.0 with breaking changes https://mini-swe-agent.com/latest/advanced/v2_migration/. No migration guide at the time of writing.

Closes #18

[UPDATE] Results over one run for each task

Compared with terminal bench reported results using gpt-5-mini

Random five challenges

uv run --no-sync inspect eval examples/terminal_bench_test --limit 5 --model openai/gpt-5-mini
Screenshot 2026-01-30 at 13 22 46

Results in line with TB

Four pass challenges

uv run --no-sync inspect eval examples/terminal_bench_test --model openai/gpt-5-mini  -T eval_names='["constraints-scheduling", "fix-git", "git-multibranch", "prove-plus-comm"]'
Screenshot 2026-01-30 at 13 20 54

Results in line with TB except for fix-git.

Key changes

mini-swe-agent CLI

  • New _mini_swe_agent/mini_swe_agent.py file utilising most of the same control structures as existing agents.
  • Currently pinned to 1.17.4 which is the latest v1 stable version, monitoring v2.0 release.
  • Consider adding version validation to warn if user requests v2.x (once released)
mini \
  --yolo \              # Run without confirmations
  --exit-immediately \  # Exit when done (non-interactive)
  --cost-limit N \      # Optional cost limit
  --task "PROMPT"       # The task to execute

Package Installation Pattern

pip wheel download on host → tarball → uv offline install in sandbox

  • using pip to download wheels on host, uv to install for sandboxes without python.
  • Follows the similar host-download/sandbox-install pattern as binary agents. Mirrored some of the existing utilities but for wheels
  • Detects platform and python version, pip manages version resolution

Note: Created new _util/agentwheel.py file with potential to expand into a more robust shared utility for other python agents with improved data structures. Suggested improvement after 2nd python agents gets ported.

Tests

  • Added test_download.py and test_ensure_wheels_install.py to test sandbox and package setup (no need for API keys)
  • Extended test_system_explorer.py to support mini-swe-agent
  • Created examples/terminal_bench_test/ to test terminal bench tasks with the agents for validation. Challenges were picked based on their consistency on the official leaderboard and cost running. Picked break-filter-js-from-html that always failed and constraints-scheduling that always passed.

Limitations

No built-in session

  • mini-swe-agent doesn't natively handle multi-turn conversation (From their docs: There is no running shell session. Every action is executed as a subprocess, that means every action is independent of the previous ones (it is literally a subprocess.run/os.system/docker exec call). )
  • [WIP] Current version handles single turn prompts (--task option). To be modified for use with multiple attempts. Options include save trajectory file to --output constructing history manually. Requires further investigation

Testing

  • Only tested with gpt-5-mini - Support/test local models
  • Extend new tests for existing agents claude_code and codex_cli and improve code reusability

Next steps & Open questions

  • Feedback from this PR
  • Extend testing - Could add more tests from existing evals to validate agent outcome (not just agent infrastructure) and expand to existing agents
  • Working on the limitations
  • Potential to simplify common control structures (needs more investigation, considered out of scope for now) and simplify some of the longer functions.

@anaoaktree anaoaktree marked this pull request as draft January 16, 2026 16:31
cmd.extend(["--cost-limit", str(cost_limit)])

# build user prompt
prompt, has_assistant_response = build_user_prompt(state.messages)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is has_assistant_response ever used / needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@celiawaggoner not in this case (neither is the while True structure) because of the limitation of the single-turn conversation (in the PR description). I left it mostly to keep the structure of the other agents and to then extend to the multi-turn version should you think that is part of the scope for this project. Otherwise I'll clean this function up a bit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for the context Ana! Let's leave this thread open and we can get input from this repo's maintainers on how if they have a preference between including the multi-turn version in this PR, or as a follow-up.

Copy link

@Jay-Bailey Jay-Bailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big thing is the error I'm getting around Python 3 - unsure why. I would call that just a TB2 issue and say maybe mini-swe-agent doesn't have to be compatible with it, but since TB2 specifically was added and presumably works for you, this should be addressed before we merge it.



@task
def terminal_bench_task(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running this with uv run --no-sync inspect eval examples/terminal_bench_test --limit 1 (no-sync needed to prevent aisitools from trying to install)

I get:

/home/ec2-user/inspect_swe/src/inspect_swe/_util/agentwheel.py:140 in detect_python_version │
│ │
│ 137 │ # Detects Python version in sandbox. │
│ 138 │ result = await sandbox.exec(bash_command("python3 --version"), user=user) │
│ 139 │ if not result.success: │
│ ❱ 140 │ │ raise RuntimeError("Python 3 not found in sandbox (required for agent)") │
│ 141 │ │
│ 142 │ # Parse "Python 3.12.0" -> "312" │
│ 143 │ match = re.search(r"Python (\d+).(\d+)", result.stdout) │
╰──────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Python 3 not found in sandbox (required for agent)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jay-Bailey The code was assuming python installed in the sandbox which didn't work for the particular task limit 1 pulled. I'm now working on changes to use uv in the sandbox instead to avoid this issue for sandboxes with no python (should have it done by tomorrow).

In the meantime, if you run your command with --T eval_names='["break-filter-js-from-html"] should work (this particular task has python installed).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm this command now runs, even without the --T.

attempt_count = 0

while True: # Kept for consistency with other agents but currently only single attempt supported
# TODO: build command with task. This only works for single-turn tasks currently. Need to update to support multi-turn (perhaps via file output option)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a todo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jay-Bailey yes - see Limitations in the PR description. I'm still waiting to know wether this should be part of the scope for this PR or done separately.

if model is not None
else get_model().name,
"OPENAI_API_BASE": f"http://localhost:{bridge.port}/v1",
"OPENAI_API_KEY": "sk-none",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended as the key?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is just so the agent detects a key, the real one will be passed by inspect.

@anaoaktree anaoaktree changed the title [DRAFT] Port mini-swe-agent Port mini-swe-agent Jan 30, 2026
@anaoaktree anaoaktree marked this pull request as ready for review January 30, 2026 13:25
@anaoaktree anaoaktree requested a review from Jay-Bailey January 30, 2026 13:25
@anaoaktree
Copy link
Author

FYI Some codex-cli tests seem to be failing

Copy link

@Jay-Bailey Jay-Bailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now looks good and the command runs. I am not sure what my approval does to your ability to merge, but I trust that Meridian has a setup they're happy with around it. I am happy with the PR. Thanks for your patience!



@task
def terminal_bench_task(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm this command now runs, even without the --T.

@celiawaggoner
Copy link

@jjallaire this is ready for review - can someone from the Meridian team take a look? There are also a couple of open questions about scope mentioned in the PR description.

@jjallaire
Copy link
Contributor

Hi @celiawaggoner, fantastic to see this ready to go! I will review this either today or early next week.

A couple of things to put on your radar pre-review:

  1. Noted the limitation about multi-turn. We definitely want to support multi-turn as well as resume semantics as we do in the other agents.

  2. Does mini-see-agent have any built in compaction? If not, you could easily add this by taking a compaction strategy and passing it on the the bridge (which actually takes care of all of the mechanics of compaction).

@celiawaggoner
Copy link

Thanks @jjallaire! @anaoaktree can you take a look at JJ's comments / questions and work on the updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inspect-native port of SWE Agent

5 participants