agent-eval-suite

A regression-testing harness for LLM agents — case files, judges, replay, and a result store that survives across runs.

Overview

Building agents is fast. Knowing whether your latest prompt change broke something is slow. agent-eval-suite (aes) is a small framework that lets you treat agent behavior the way you treat any other software: with versioned test cases, deterministic replays, and a CI-friendly pass/fail exit code.

It is intentionally small (~1800 LoC) and provider-agnostic. You bring an async function agent(input) -> response; aes brings the harness.

What this is

YAML-defined cases with substring, exact-match, JSON-path, and pluggable LLM-judge expectations.
Async runner with bounded concurrency and per-case timeouts.
Recorder/Replayer so you can capture a real agent run once and replay it offline — no API costs, no flakiness in CI.
SQLite result store for tracking pass-rate over time and diffing two runs to surface regressions.
CLI (aes run) that exits non-zero when any case fails — drop it into GitHub Actions and you have agent CI.

What this is not

Not a benchmark leaderboard. There are plenty of those.
Not an evaluation dataset. You write the cases; aes only runs them.
Not a UI. Use the JSON output, aes diff, or pipe into your own dashboard.

Architecture

┌──────────────────┐
│  cases.yaml      │  ── declarative cases + expectations
└────────┬─────────┘
         │  loader
         ▼
┌──────────────────┐         ┌──────────────────┐
│   Runner         │◀────────│   Your agent fn  │
│ (async, bounded) │         │ async (input)→r  │
└────────┬─────────┘         └──────────────────┘
         │  per-case Verdict from a Judge
         ▼
┌──────────────────┐         ┌──────────────────┐
│   EvalReport     │────────▶│  ResultStore     │
│  (rich/json)     │         │  (sqlite)        │
└──────────────────┘         └──────────────────┘

Installation

pip install agent-eval-suite          # core
pip install "agent-eval-suite[sql]"   # SQLAlchemy-backed store (optional)

Quick Start

1. Write some cases

# cases/smoke.yaml
cases:
  - id: greet-001
    input: "Say hello in French"
    expect:
      contains: ["bonjour"]
    tags: [smoke, i18n]

  - id: math-001
    input: "What is 17 * 23?"
    expect:
      equals: "391"
    tags: [smoke, math]

  - id: json-001
    input: "Return JSON with the user's full name and age."
    expect:
      json_path:
        full_name: "Alice Chen"
        age: 30
    tags: [structured]

2. Define your agent

# my_agent.py
async def agent(query: str) -> str:
    # Your real agent — LLM call, tool loop, whatever
    return await run_my_agent(query)

3. Run

aes run cases/smoke.yaml --agent my_agent:agent --concurrency 4

┏━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Case ID   ┃ Pass ┃  Score ┃ Duration ┃ Reason                       ┃
┡━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ greet-001 │  ✓   │   1.00 │    342ms │ all substrings present       │
│ math-001  │  ✗   │   0.00 │    188ms │ got "approximately 390"      │
│ json-001  │  ✓   │   1.00 │    412ms │ all json paths match         │
└───────────┴──────┴────────┴──────────┴──────────────────────────────┘
2/3 passed (66.7%)

Exit code is 0 when everything passes, 1 otherwise.

Judges

Judge	Picks up automatically when …	Verdict
`exact`	`expect.equals` is set	string ==
`contains`	`expect.contains` / `not_contains` is set	substring present
`jsonpath`	`expect.json_path` is set	parsed JSON eq.
`llm`	`expect.judge: llm` is set	caller-supplied

Want a custom judge?

from agent_eval_suite import Runner
from agent_eval_suite.judge import Verdict

class CosineSimilarity:
    name = "cosine"
    def evaluate(self, response, expect):
        sim = compute_similarity(response, expect.metadata["target"])
        return Verdict(sim > 0.8, sim, f"similarity={sim:.3f}")

runner = Runner(my_agent)
runner.register_judge(CosineSimilarity())

Deterministic Replay

Capturing a real run, then replaying it offline:

from agent_eval_suite.replay import Recorder

rec = Recorder("recording.json")

async def recording_agent(query):
    out = await real_agent(query)
    rec.record(query, out)
    return out

# ... run once with the real agent ...
rec.save()

# in CI, replay:
from agent_eval_suite.replay import Recorder
replayer = Recorder.load("recording.json")  # returns a Replayer
asyncio.run(Runner(replayer).run_all(cases))

This is the difference between "we run our LLM against 200 cases on every PR and burn through API credits" and "we run our recorded outputs and only call the API when intentionally regenerating the recording."

Tracking Regressions

from agent_eval_suite.store import ResultStore

store = ResultStore("history.sqlite")
store.save(run_id="pr-447", results=results)

# in nightly job
regressions = store.regressions(run_id="nightly-2026-04-30",
                                baseline_run_id="nightly-2026-04-29")
if regressions:
    print(f"NEW FAILURES: {regressions}")

CI integration

# .github/workflows/agent-eval.yml
- run: pip install agent-eval-suite
- run: aes run cases/regression.yaml --agent app.agent:run --output report.json
- uses: actions/upload-artifact@v4
  with:
    name: eval-report
    path: report.json

Citation

If aes is useful in your work, the BibTeX:

@misc{agent-eval-suite,
  author = {fragres},
  title  = {agent-eval-suite: regression testing for LLM agents},
  year   = {2025},
  url    = {https://github.com/fragres/agent-eval-suite}
}

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
examples		examples
src/agent_eval_suite		src/agent_eval_suite
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-eval-suite

Overview

What this is

What this is not

Architecture

Installation

Quick Start

1. Write some cases

2. Define your agent

3. Run

Judges

Deterministic Replay

Tracking Regressions

CI integration

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-eval-suite

Overview

What this is

What this is not

Architecture

Installation

Quick Start

1. Write some cases

2. Define your agent

3. Run

Judges

Deterministic Replay

Tracking Regressions

CI integration

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages