Loopsmith is an eval and promotion harness for AI agents. It helps you improve agents the way you improve software: test changes, compare outputs, keep only what holds up.
Tagline: Improve agents the way you improve software: define the eval, test the candidate, keep only what survives evidence.
Loopsmith is harness-agnostic. It works with OpenClaw, Hermes, Codex, OpenCode, Claude Code, or any other agent setup that can produce baseline/candidate outputs and read/write repo files.
- compare baseline and candidate agent behaviour with evidence
- turn recurring failures into eval cases instead of complaints
- promote prompt, policy, or evaluator changes only after review
- keep a ledger of why an agent behaviour changed
Loopsmith is not the sprint protocol itself. That is proof-loop.
Proof Loop governs a single task: frozen acceptance criteria, separate verifier, durable verdict artifacts, and no self-certified done claims.
Loopsmith improves repeated agent behaviour over time: baseline vs candidate, eval packs, scoring, promotion/rejection, and a ledger.
Use Proof Loop inside a task. Use Loopsmith when the same failure pattern keeps coming back and the agent, prompt, policy, or evaluator itself needs measurable improvement. Both are intentionally file/protocol based, so they can travel across harnesses instead of depending on one vendor runtime. See docs/proof-loop-relationship.md.
Use Loopsmith when an agent is producing output that is:
- good enough to be dangerous
- repetitive or sludgy
- hard to trust
- hard to review consistently
- drifting after prompt or policy changes
Loopsmith is for cases where taste alone is not enough and blind prompting is not good enough.
It helps answer questions like:
- Is this candidate actually better than the baseline?
- Did we improve the output or just rewrite it differently?
- Which failures should block promotion?
- What is live right now, and why?
Loopsmith is not a chatbot wrapper, a benchmark vanity project, or a generic agent platform.
It is not trying to replace judgment. It is trying to make judgment more disciplined.
Each loop compares:
- a baseline
- a candidate
- one or more eval cases
- a verdict and promotion state
A candidate must improve evidence, not just sound clever.
See examples/README.md for the example index, or jump straight to examples/before-after-eval.md. It shows how a recurring research-brief quality problem becomes a baseline-vs-candidate eval with review artifacts.
A research agent can be factually competent but still painful to read. The brief may repeat the same thesis across sections, keep weak topics alive, and bury the useful signal under repetitive scaffolding.
Loopsmith can treat that as a bounded quality problem:
- baseline = current research brief policy
- candidate = shorter, sharper signal-density policy
- eval = anti-sludge, anti-repetition, weak-topic-drop checks
- promotion = only after the candidate clearly beats the baseline
See:
docs/research-brief-quality-pack.mdcandidates/scout/research-policy-v3.mdcandidates/scout/candidate-signal-density.json
Loopsmith is useful for recurring failure modes such as:
- robotic direct-chat replies
- generic research sludge
- false completion claims
- vague QA verdicts
- proof without proof
- cumulative regression dishonesty
- repetitive scaffolding that hides weak signal
Use this repo when a failure pattern keeps coming back and needs to become an eval instead of another complaint. Loopsmith compares baseline and candidate behaviour, writes review artifacts, and promotes only what survives evidence.
Use the neighbouring tools at different points in the workflow:
| Need | Use |
|---|---|
| Turn a fuzzy request into an executable agent brief | Brief Master |
| Prove one coding task is actually done | Proof Loop |
| Improve repeated agent behaviour with evals | Loopsmith |
| Keep source-backed memory for long-running agents | Sovereign Brain |
| Stop frontend agents producing generic UI sludge | no-slop-ui |
A practical chain looks like this: messy request -> Brief Master brief -> Proof Loop task -> Loopsmith eval if the same failure keeps recurring -> Sovereign Brain records the durable decision.
- Proof Loop - task-level completion protocol. Loopsmith is the next step when the same proof failure keeps recurring.
- Sovereign Brain - source-backed memory and review workflow for long-running agents; useful context for eval decisions and agent behaviour history.
- Brief Master - improves the briefs that become eval inputs, candidate policies, or Proof Loop specs.
agents/— agent profilesevals/— agent and shared eval pack definitionsbaseline/— current baseline outputs or fixturescandidates/— candidate variants under testpromoted/— promoted candidate manifestsrejected/— rejected candidate manifestsledger/— promotion historypolicies/— mutation boundaries and promotion rulesruns/— generated run logs, summaries, review queue, promotion index, and provenance viewssrc/— schemas, scoring, loaders, runner, CLI, summaries, operator viewsdocs/— design notes, usage, review flow, artifact policy, evaluator strategy, Proof Loop relationship, shared-pack guidance, sanitisation notes, and pack patterns
python3 src/cli.py run --agent conductor
python3 src/cli.py run --agent scout --json
python3 src/cli.py run --agent iris
python3 src/cli.py run --agent rex
python3 src/cli.py run-shared --pack golden:anti-bullshit
python3 src/cli.py promote --agent conductor --candidate candidate-001 --approved-by reviewerLoopsmith can improve research agents that are technically competent but operationally dull to read.
A research brief quality pack can encode recurring failure modes like:
- repeated thesis inflation
- template fatigue across sections
- weak-topic retention
- reader-specific scaffolding bloat
- fake completeness instead of signal density
See docs/research-brief-quality-pack.md for the public pattern.
Once a repo has multiple packs and promotion states, the human operator needs a clean control surface. Loopsmith generates:
- pack summaries
- a review queue
- a promotion index
- a baseline provenance view
so a reviewer can quickly see what is eligible, what needs review, what regressed, what is currently live, and where that live state came from.
Some failure modes are not agent-local. Shared packs let Loopsmith express cross-agent behavioural families as first-class objects with explicit metadata, participating agents, and clearer operator-facing summaries.
Some cases are too important to judge with loose heuristics alone. Loopsmith supports case-specific evaluators for proof-heavy checks such as:
- Forge proof-before-done
- Iris AC verdict discipline
- Iris review-vs-validation boundary
- Rex cumulative regression honesty
- Rex layered reporting honesty
Loopsmith shipped a real v1 and is now moving through hardening passes. The public-share cleanup is documented in docs/recovery-pass.md.
- repo skeleton
- initial eval schema
- initial run logging schema
- mutation boundaries
- first loop runner
- 3 strong demo agents
- starter packs for the rest
- public sanitisation
- better scoring (
pass_fail,rubric,composite) - promotion flow with human approval
- file-driven runner + CLI
- stronger Iris and Rex packs
- anti-bullshit golden cases
- pack-level review summaries
- stronger shared-pack review flow
- review queue and promotion index
- case-specific evaluators for proof-heavy cases
- documented evaluator strategy and selective expansion rules
- artifact policy and baseline provenance views
- shared packs as first-class objects with metadata
- reusable research-brief quality pack pattern for anti-sludge and signal-density tuning
Loopsmith is currently deepest in these kinds of agent work:
- direct response quality
- research brief quality
- proof-before-done implementation discipline
- review verdict quality
- acceptance and regression reporting honesty
The rest of the repo still ships with lighter starter packs while the core patterns are being hardened.
- No giant-file soup
- Split by concern
- Explicit mutation boundaries
- Human promotion gate for meaningful changes
- Public-safe structure from the start