A regression-testing harness for LLM agents — case files, judges, replay, and a result store that survives across runs.
Building agents is fast. Knowing whether your latest prompt change broke
something is slow. agent-eval-suite (aes) is a small framework that lets
you treat agent behavior the way you treat any other software: with
versioned test cases, deterministic replays, and a CI-friendly pass/fail
exit code.
It is intentionally small (~1800 LoC) and provider-agnostic. You bring an
async function agent(input) -> response; aes brings the harness.
- YAML-defined cases with substring, exact-match, JSON-path, and pluggable LLM-judge expectations.
- Async runner with bounded concurrency and per-case timeouts.
- Recorder/Replayer so you can capture a real agent run once and replay it offline — no API costs, no flakiness in CI.
- SQLite result store for tracking pass-rate over time and diffing two runs to surface regressions.
- CLI (
aes run) that exits non-zero when any case fails — drop it into GitHub Actions and you have agent CI.
- Not a benchmark leaderboard. There are plenty of those.
- Not an evaluation dataset. You write the cases;
aesonly runs them. - Not a UI. Use the JSON output,
aes diff, or pipe into your own dashboard.
┌──────────────────┐
│ cases.yaml │ ── declarative cases + expectations
└────────┬─────────┘
│ loader
▼
┌──────────────────┐ ┌──────────────────┐
│ Runner │◀────────│ Your agent fn │
│ (async, bounded) │ │ async (input)→r │
└────────┬─────────┘ └──────────────────┘
│ per-case Verdict from a Judge
▼
┌──────────────────┐ ┌──────────────────┐
│ EvalReport │────────▶│ ResultStore │
│ (rich/json) │ │ (sqlite) │
└──────────────────┘ └──────────────────┘
pip install agent-eval-suite # core
pip install "agent-eval-suite[sql]" # SQLAlchemy-backed store (optional)# cases/smoke.yaml
cases:
- id: greet-001
input: "Say hello in French"
expect:
contains: ["bonjour"]
tags: [smoke, i18n]
- id: math-001
input: "What is 17 * 23?"
expect:
equals: "391"
tags: [smoke, math]
- id: json-001
input: "Return JSON with the user's full name and age."
expect:
json_path:
full_name: "Alice Chen"
age: 30
tags: [structured]# my_agent.py
async def agent(query: str) -> str:
# Your real agent — LLM call, tool loop, whatever
return await run_my_agent(query)aes run cases/smoke.yaml --agent my_agent:agent --concurrency 4┏━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Case ID ┃ Pass ┃ Score ┃ Duration ┃ Reason ┃
┡━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ greet-001 │ ✓ │ 1.00 │ 342ms │ all substrings present │
│ math-001 │ ✗ │ 0.00 │ 188ms │ got "approximately 390" │
│ json-001 │ ✓ │ 1.00 │ 412ms │ all json paths match │
└───────────┴──────┴────────┴──────────┴──────────────────────────────┘
2/3 passed (66.7%)
Exit code is 0 when everything passes, 1 otherwise.
| Judge | Picks up automatically when … | Verdict |
|---|---|---|
exact |
expect.equals is set |
string == |
contains |
expect.contains / not_contains is set |
substring present |
jsonpath |
expect.json_path is set |
parsed JSON eq. |
llm |
expect.judge: llm is set |
caller-supplied |
Want a custom judge?
from agent_eval_suite import Runner
from agent_eval_suite.judge import Verdict
class CosineSimilarity:
name = "cosine"
def evaluate(self, response, expect):
sim = compute_similarity(response, expect.metadata["target"])
return Verdict(sim > 0.8, sim, f"similarity={sim:.3f}")
runner = Runner(my_agent)
runner.register_judge(CosineSimilarity())Capturing a real run, then replaying it offline:
from agent_eval_suite.replay import Recorder
rec = Recorder("recording.json")
async def recording_agent(query):
out = await real_agent(query)
rec.record(query, out)
return out
# ... run once with the real agent ...
rec.save()
# in CI, replay:
from agent_eval_suite.replay import Recorder
replayer = Recorder.load("recording.json") # returns a Replayer
asyncio.run(Runner(replayer).run_all(cases))This is the difference between "we run our LLM against 200 cases on every PR and burn through API credits" and "we run our recorded outputs and only call the API when intentionally regenerating the recording."
from agent_eval_suite.store import ResultStore
store = ResultStore("history.sqlite")
store.save(run_id="pr-447", results=results)
# in nightly job
regressions = store.regressions(run_id="nightly-2026-04-30",
baseline_run_id="nightly-2026-04-29")
if regressions:
print(f"NEW FAILURES: {regressions}")# .github/workflows/agent-eval.yml
- run: pip install agent-eval-suite
- run: aes run cases/regression.yaml --agent app.agent:run --output report.json
- uses: actions/upload-artifact@v4
with:
name: eval-report
path: report.jsonIf aes is useful in your work, the BibTeX:
@misc{agent-eval-suite,
author = {fragres},
title = {agent-eval-suite: regression testing for LLM agents},
year = {2025},
url = {https://github.com/fragres/agent-eval-suite}
}MIT — see LICENSE.