Skip to content

fragres/agent-eval-suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-eval-suite

A regression-testing harness for LLM agents — case files, judges, replay, and a result store that survives across runs.

Overview

Building agents is fast. Knowing whether your latest prompt change broke something is slow. agent-eval-suite (aes) is a small framework that lets you treat agent behavior the way you treat any other software: with versioned test cases, deterministic replays, and a CI-friendly pass/fail exit code.

It is intentionally small (~1800 LoC) and provider-agnostic. You bring an async function agent(input) -> response; aes brings the harness.

What this is

  • YAML-defined cases with substring, exact-match, JSON-path, and pluggable LLM-judge expectations.
  • Async runner with bounded concurrency and per-case timeouts.
  • Recorder/Replayer so you can capture a real agent run once and replay it offline — no API costs, no flakiness in CI.
  • SQLite result store for tracking pass-rate over time and diffing two runs to surface regressions.
  • CLI (aes run) that exits non-zero when any case fails — drop it into GitHub Actions and you have agent CI.

What this is not

  • Not a benchmark leaderboard. There are plenty of those.
  • Not an evaluation dataset. You write the cases; aes only runs them.
  • Not a UI. Use the JSON output, aes diff, or pipe into your own dashboard.

Architecture

┌──────────────────┐
│  cases.yaml      │  ── declarative cases + expectations
└────────┬─────────┘
         │  loader
         ▼
┌──────────────────┐         ┌──────────────────┐
│   Runner         │◀────────│   Your agent fn  │
│ (async, bounded) │         │ async (input)→r  │
└────────┬─────────┘         └──────────────────┘
         │  per-case Verdict from a Judge
         ▼
┌──────────────────┐         ┌──────────────────┐
│   EvalReport     │────────▶│  ResultStore     │
│  (rich/json)     │         │  (sqlite)        │
└──────────────────┘         └──────────────────┘

Installation

pip install agent-eval-suite          # core
pip install "agent-eval-suite[sql]"   # SQLAlchemy-backed store (optional)

Quick Start

1. Write some cases

# cases/smoke.yaml
cases:
  - id: greet-001
    input: "Say hello in French"
    expect:
      contains: ["bonjour"]
    tags: [smoke, i18n]

  - id: math-001
    input: "What is 17 * 23?"
    expect:
      equals: "391"
    tags: [smoke, math]

  - id: json-001
    input: "Return JSON with the user's full name and age."
    expect:
      json_path:
        full_name: "Alice Chen"
        age: 30
    tags: [structured]

2. Define your agent

# my_agent.py
async def agent(query: str) -> str:
    # Your real agent — LLM call, tool loop, whatever
    return await run_my_agent(query)

3. Run

aes run cases/smoke.yaml --agent my_agent:agent --concurrency 4
┏━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Case ID   ┃ Pass ┃  Score ┃ Duration ┃ Reason                       ┃
┡━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ greet-001 │  ✓   │   1.00 │    342ms │ all substrings present       │
│ math-001  │  ✗   │   0.00 │    188ms │ got "approximately 390"      │
│ json-001  │  ✓   │   1.00 │    412ms │ all json paths match         │
└───────────┴──────┴────────┴──────────┴──────────────────────────────┘
2/3 passed (66.7%)

Exit code is 0 when everything passes, 1 otherwise.

Judges

Judge Picks up automatically when … Verdict
exact expect.equals is set string ==
contains expect.contains / not_contains is set substring present
jsonpath expect.json_path is set parsed JSON eq.
llm expect.judge: llm is set caller-supplied

Want a custom judge?

from agent_eval_suite import Runner
from agent_eval_suite.judge import Verdict

class CosineSimilarity:
    name = "cosine"
    def evaluate(self, response, expect):
        sim = compute_similarity(response, expect.metadata["target"])
        return Verdict(sim > 0.8, sim, f"similarity={sim:.3f}")

runner = Runner(my_agent)
runner.register_judge(CosineSimilarity())

Deterministic Replay

Capturing a real run, then replaying it offline:

from agent_eval_suite.replay import Recorder

rec = Recorder("recording.json")

async def recording_agent(query):
    out = await real_agent(query)
    rec.record(query, out)
    return out

# ... run once with the real agent ...
rec.save()

# in CI, replay:
from agent_eval_suite.replay import Recorder
replayer = Recorder.load("recording.json")  # returns a Replayer
asyncio.run(Runner(replayer).run_all(cases))

This is the difference between "we run our LLM against 200 cases on every PR and burn through API credits" and "we run our recorded outputs and only call the API when intentionally regenerating the recording."

Tracking Regressions

from agent_eval_suite.store import ResultStore

store = ResultStore("history.sqlite")
store.save(run_id="pr-447", results=results)

# in nightly job
regressions = store.regressions(run_id="nightly-2026-04-30",
                                baseline_run_id="nightly-2026-04-29")
if regressions:
    print(f"NEW FAILURES: {regressions}")

CI integration

# .github/workflows/agent-eval.yml
- run: pip install agent-eval-suite
- run: aes run cases/regression.yaml --agent app.agent:run --output report.json
- uses: actions/upload-artifact@v4
  with:
    name: eval-report
    path: report.json

Citation

If aes is useful in your work, the BibTeX:

@misc{agent-eval-suite,
  author = {fragres},
  title  = {agent-eval-suite: regression testing for LLM agents},
  year   = {2025},
  url    = {https://github.com/fragres/agent-eval-suite}
}

License

MIT — see LICENSE.

About

Regression testing harness for LLM agents: YAML cases, judges, replay, SQLite result store, CI-friendly CLI.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages