Skip to content

feat(eval): outcome-eval harness — measures impact, not just correctness#2

Closed
mikelninh wants to merge 1 commit into
mainfrom
feat/eval-harness
Closed

feat(eval): outcome-eval harness — measures impact, not just correctness#2
mikelninh wants to merge 1 commit into
mainfrom
feat/eval-harness

Conversation

@mikelninh
Copy link
Copy Markdown
Owner

The existing test suite proves correctness. This adds the missing layer: does the MCP measurably change the answers an LLM gives a citizen?

First committed run

Metric Baseline Treatment (+GitLaw MCP)
Hallucination rate 5.9% 0.0%
Expected hit rate 62.5% 62.5%
Citations per question 2.12 1.46
Mean tool calls (treatment) 1.25

The headline: every cited § the treatment model produces is verified against the corpus. Zero hallucinations. The 5.9% baseline figure includes one question (gg-03 — demo permit) where the model invented three fake paragraphs — exactly the scenario this MCP exists to prevent.

Honest non-headline: hit rate is identical. The treatment model becomes more conservative (1.46 § vs 2.12 citations per answer) because it only emits verified statutes. On well-known questions in our current set, gpt-4o-mini already gets the canonical § right. Widening the gap needs harder long-tail questions where the baseline model is more likely to invent — on the roadmap.

What's in the PR

  • questions.json — 25 hand-labelled real Lebenslagen questions w/ expected_paragraphs
  • run.py — harness with --model / --limit flags, OpenAI function-calling to dispatch GitLaw tools
  • README.md — how to run, how to read the numbers, honest known limits
  • eval_summary.md — committed as the public record of the latest run
  • .gitignore — keeps timestamped per-run dumps out of git

Why this matters

Reproducible, verifiable, public. Anyone can clone, run, get the same numbers. That's the foundation for any claim about "this MCP makes LLMs more truthful."

🤖 Generated with Claude Code

…s. without MCP

The existing tests in gitlaw_mcp/tests/ prove correctness (the tools return the
right answer when asked correctly). They don't prove *impact* — does giving an
LLM these tools actually change how it answers a citizen's legal question?

This eval harness answers exactly that. 25 hand-labelled real Lebenslagen
questions, run twice through gpt-4o-mini:

  BASELINE   — no tools, answers from training-only knowledge
  TREATMENT  — same prompt, GitLaw tools available via OpenAI function-calling
               (functionally equivalent to how an MCP client exposes them)

Headline result on the first committed run:
  hallucination rate:  5.9% → 0.0%   (every cited § now verified against corpus)
  expected hit rate:   62.5% → 62.5%   (no change — see below for honest read)
  mean tool calls per question (treatment): 1.25

The hallucination story is real and reproducible. The hit-rate stability is
the honest part: gpt-4o-mini already knows the well-known statutes in our
question set; the treatment becomes more conservative (cites 1.46 § vs 2.12
in baseline) because it only emits verified citations. The diagnostic info
in eval_summary.md per-question table shows exactly which questions need
better prompting in treatment and which need harder long-tail entries to
widen the gap.

Files:
  questions.json  — 25 hand-labelled questions w/ expected_paragraphs
  run.py          — eval harness with --model / --limit flags
  README.md       — how to run, how to read, honest limits
  eval_summary.md — latest run committed as public record (regenerated each run)
  .gitignore      — keeps timestamped per-run JSON dumps out of git history

Roadmap on the README: harder long-tail questions, multi-model comparison
(gpt-3.5-turbo / gpt-4o-mini / gpt-4o), citation-extraction improvements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
gitlaw Ready Ready Preview, Comment May 28, 2026 4:59pm

Request Review

@mikelninh
Copy link
Copy Markdown
Owner Author

Superseded by #3 — that PR contains the eval harness plus the freshness scaffold + Phase-1a live sync. Closing this in favour of the broader one.

@mikelninh mikelninh closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants