metaharness is an open source Python library for optimizing executable harnesses around agentic coding systems.
It is inspired by the Meta Harness paper and is an unofficial open source implementation of the core ideas in that work.
The current benchmark evidence in this repository is centered on the Codex CLI path, including hosted Codex and Codex over local Ollama models.
Codex is the primary and validated backend in this repository today.
Gemini CLI is the only additional experimental integration.
It is built for teams who want to improve the code and files around an agent workflow, not just the prompt. That includes instruction files, setup flows, validation scripts, test scripts, routing logic, and other executable support code.
Many agent failures come from the harness around the model:
- weak repository instructions
- missing setup steps
- broken validation logic
- incomplete test flows
- poor iteration memory
- acceptance checks that do not match the real task
metaharness turns those artifacts into a repeatable optimization target with stored evidence for every proposal.
It also captures a compact environment snapshot before each proposal so agents do not waste early turns on basic workspace discovery.
Projects can also declare an allowed write scope so off-target edits are rejected automatically.
metaharness runs an outer optimization loop around a harness:
- start from a baseline workspace
- ask a coding agent to improve it
- validate and evaluate the result
- keep the best candidate
- store all artifacts on disk
The result is a practical, inspectable workflow for improving real harnesses instead of ad hoc prompt tinkering.
- developers building agentic coding systems who want to optimize harness code, workflow scripts, retrieval wrappers, routing, and evaluation flows
- practitioners using coding-agent tools who want to improve
AGENTS.md,GEMINI.md, bootstrap scripts, validation scripts, and acceptance tests
Install the published CLI from PyPI:
uv tool install superagentic-metaharnessCheck the command:
metaharness --helpIf you want to run the built-in examples in this repository, use a source checkout:
uv syncRun the fake backend on a real benchmark:
uv run metaharness run \
examples/python_fixture_benchmark \
--backend fake \
--budget 1 \
--run-name quickstartInspect the run:
uv run metaharness inspect \
examples/python_fixture_benchmark/runs/quickstartExport the candidate ledger:
uv run metaharness ledger \
examples/python_fixture_benchmark/runs/quickstart \
--tsvRun a saved experiment matrix:
uv run metaharness experiment \
--config examples/experiment_configs/fake-benchmarks.json- a minimal optimization engine
- a filesystem-backed run store
- automatic environment bootstrap snapshots for each proposal
- optional write-scope enforcement through
allowed_write_paths - a provider-neutral proposer backend interface
- a real
CodexExecBackend - an experimental
GeminiCliBackend - a deterministic
FakeBackend - extension backends via
backend_plugins(module:callablefactories) - a coding-tool integration for instruction files and script-based harnesses
- explicit per-candidate outcomes:
keep,discard,crash,timeout,no-change, andscope-violation - reporting commands for
inspect,ledger,summarize, andcompare - trace evidence injection with
--trace-evidencefor HALO/RLM analysis reports - AHE-style change manifests and attribution reports for evidence-backed candidate edits
- experiment-matrix execution with JSON and TSV outputs
- benchmark targets and experiment records
The repository currently includes:
- two real coding-tool benchmark targets
- a smaller deterministic ticket-router example
- hosted Codex runs on the real benchmarks
- local Codex over Ollama runs with
gpt-oss:20bandgpt-oss:120b - a docs site published from GitHub Actions
Current documented experiments in this repository show:
- hosted Codex solves both real benchmarks in one proposal iteration
- local
gpt-oss:120bsolvespython_fixture_benchmark - local
gpt-oss:20bis useful for smoke checks but timed out on the current real benchmark runs
Detailed experiment records:
- Codex is the main validated harness path in this repository today
- hosted Codex is the strongest current path for real runs
- local Codex over Ollama works and has been exercised with
gpt-oss:20bandgpt-oss:120b - Gemini is implemented as an experimental backend and is not part of the main validated release path
All real provider results currently documented in this repository were produced through the Codex CLI path.
That includes both hosted Codex runs and local Ollama runs driven through Codex with gpt-oss models.
Other coding-agent evaluations in the wider ecosystem often emphasize Claude Code and Opus, but this repository's current benchmark evidence is Codex-first.
- Project documentation
- Getting started
- Architecture
- Providers
- Extensions
- Benchmarks
- Official comparison
- Alignment
- CLI reference
- Experiments
Published package:
- PyPI distribution:
superagentic-metaharness - CLI command:
metaharness - import package:
metaharness
Install the CLI with uv:
uv tool install superagentic-metaharnessUpgrade it later:
uv tool upgrade superagentic-metaharnessInstall it into a Python project dependency set:
uv add superagentic-metaharnessInstall with pip:
pip install superagentic-metaharnessSource checkout setup:
uv syncIf you want the docs toolchain too:
uv sync --group devCheck the CLI:
uv run metaharness --helpEditable install with pip also works:
pip install -e .Requirements:
codexCLI installed- authenticated Codex session or API key
- outbound network access
Run a real benchmark with hosted Codex:
uv run metaharness run \
examples/python_fixture_benchmark \
--backend codex \
--hosted \
--budget 1 \
--run-name hosted-codexImportant:
- use
--hostedwhen a project config defaults to local Ollama - the library is ready for hosted Codex runs today
Probe the local setup:
uv run metaharness smoke codex \
examples/python_fixture_benchmark \
--probe-only \
--oss \
--local-provider ollama \
--model gpt-oss:20bRun with gpt-oss:20b:
uv run metaharness run \
examples/python_fixture_benchmark \
--backend codex \
--oss \
--local-provider ollama \
--model gpt-oss:20b \
--proposal-timeout 240 \
--budget 1 \
--run-name ollama-20bRun with gpt-oss:120b:
uv run metaharness run \
examples/python_fixture_benchmark \
--backend codex \
--oss \
--local-provider ollama \
--model gpt-oss:120b \
--proposal-timeout 420 \
--budget 1 \
--run-name ollama-120bReal benchmarks:
Smaller deterministic example:
Run the ticket router example:
uv run python examples/ticket_router/run.py --backend fake --budget 1Create a coding-tool project:
uv run metaharness scaffold coding-tool ./my-coding-tool-optimizerAvailable profiles:
standardlocal-oss-smokelocal-oss-medium
Run the scaffold with the fake backend:
uv run metaharness run ./my-coding-tool-optimizer --backend fake --budget 1You can register custom closed-source or internal harness adapters without editing metaharness core code.
Add a plugin in metaharness.json:
{
"backend_plugins": {
"cursor": {
"factory": "my_harness_plugins.cursor:create_backend",
"options": {
"model": "cursor-pro"
}
}
}
}Then run it like any other backend:
uv run metaharness run ./my-coding-tool-optimizer --backend cursor --budget 1Factory contract:
- reference format:
module:callable - callable receives
projectandoptions - callable returns an object with
name,prepare,invoke, andcollect
Create an official-style domain onboarding pack:
uv run metaharness onboard ./my-domain-onboardingThis writes:
ONBOARDING.mdwith a question-driven domain setup flowdomain_spec.mdwith a structured template for search and evaluation design
Create a scaffold:
uv run metaharness scaffold coding-tool ./my-projectRun a project:
uv run metaharness run ./my-project --backend fake --budget 1Probe Codex:
uv run metaharness smoke codex ./my-project --probe-onlyInspect a run:
uv run metaharness inspect ./my-project/runs/exampleCompare runs:
uv run metaharness compare \
./examples/python_fixture_benchmark/runs/hosted-codex-20260401 \
./examples/python_fixture_benchmark/runs/ollama-20b-20260401 \
./examples/python_fixture_benchmark/runs/ollama-120b-20260401Run an experiment matrix:
uv run metaharness experiment --config examples/experiment_configs/fake-benchmarks.jsonEvery run stores:
- prompts
- candidate workspaces
- validation results
- evaluation results
- proposal metadata
- workspace diffs
- per-candidate manifests
That makes the optimization history reviewable, debuggable, and reusable.
Compile checks:
uv run python -m compileall -q src tests examples docsUnit tests:
uv run python -m unittest discover -s tests -vDocs build:
uv run mkdocs build --strictFake benchmark smoke runs:
uv run metaharness run examples/python_fixture_benchmark --backend fake --budget 1 --run-name ci-fixture-local
uv run metaharness run examples/python_cli_benchmark --backend fake --budget 1 --run-name ci-cli-local
uv run python examples/ticket_router/run.py --backend fake --budget 1Apache 2.0. See LICENSE.