feat(eval): outcome-eval harness — measures impact, not just correctness by mikelninh · Pull Request #2 · mikelninh/gitlaw

mikelninh · 2026-05-28T16:57:58Z

The existing test suite proves correctness. This adds the missing layer: does the MCP measurably change the answers an LLM gives a citizen?

First committed run

Metric	Baseline	Treatment (+GitLaw MCP)
Hallucination rate	5.9%	0.0%
Expected hit rate	62.5%	62.5%
Citations per question	2.12	1.46
Mean tool calls (treatment)	—	1.25

The headline: every cited § the treatment model produces is verified against the corpus. Zero hallucinations. The 5.9% baseline figure includes one question (gg-03 — demo permit) where the model invented three fake paragraphs — exactly the scenario this MCP exists to prevent.

Honest non-headline: hit rate is identical. The treatment model becomes more conservative (1.46 § vs 2.12 citations per answer) because it only emits verified statutes. On well-known questions in our current set, gpt-4o-mini already gets the canonical § right. Widening the gap needs harder long-tail questions where the baseline model is more likely to invent — on the roadmap.

What's in the PR

questions.json — 25 hand-labelled real Lebenslagen questions w/ expected_paragraphs
run.py — harness with --model / --limit flags, OpenAI function-calling to dispatch GitLaw tools
README.md — how to run, how to read the numbers, honest known limits
eval_summary.md — committed as the public record of the latest run
.gitignore — keeps timestamped per-run dumps out of git

Why this matters

Reproducible, verifiable, public. Anyone can clone, run, get the same numbers. That's the foundation for any claim about "this MCP makes LLMs more truthful."

🤖 Generated with Claude Code

…s. without MCP The existing tests in gitlaw_mcp/tests/ prove correctness (the tools return the right answer when asked correctly). They don't prove *impact* — does giving an LLM these tools actually change how it answers a citizen's legal question? This eval harness answers exactly that. 25 hand-labelled real Lebenslagen questions, run twice through gpt-4o-mini: BASELINE — no tools, answers from training-only knowledge TREATMENT — same prompt, GitLaw tools available via OpenAI function-calling (functionally equivalent to how an MCP client exposes them) Headline result on the first committed run: hallucination rate: 5.9% → 0.0% (every cited § now verified against corpus) expected hit rate: 62.5% → 62.5% (no change — see below for honest read) mean tool calls per question (treatment): 1.25 The hallucination story is real and reproducible. The hit-rate stability is the honest part: gpt-4o-mini already knows the well-known statutes in our question set; the treatment becomes more conservative (cites 1.46 § vs 2.12 in baseline) because it only emits verified citations. The diagnostic info in eval_summary.md per-question table shows exactly which questions need better prompting in treatment and which need harder long-tail entries to widen the gap. Files: questions.json — 25 hand-labelled questions w/ expected_paragraphs run.py — eval harness with --model / --limit flags README.md — how to run, how to read, honest limits eval_summary.md — latest run committed as public record (regenerated each run) .gitignore — keeps timestamped per-run JSON dumps out of git history Roadmap on the README: harder long-tail questions, multi-model comparison (gpt-3.5-turbo / gpt-4o-mini / gpt-4o), citation-extraction improvements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-28T16:58:00Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
gitlaw	Ready	Preview, Comment	May 28, 2026 4:59pm

mikelninh · 2026-05-28T18:23:46Z

Superseded by #3 — that PR contains the eval harness plus the freshness scaffold + Phase-1a live sync. Closing this in favour of the broader one.

vercel Bot deployed to Preview May 28, 2026 16:59 View deployment

mikelninh closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): outcome-eval harness — measures impact, not just correctness#2

feat(eval): outcome-eval harness — measures impact, not just correctness#2
mikelninh wants to merge 1 commit into
mainfrom
feat/eval-harness

mikelninh commented May 28, 2026

Uh oh!

vercel Bot commented May 28, 2026 •

edited

Loading

Uh oh!

mikelninh commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikelninh commented May 28, 2026

First committed run

What's in the PR

Why this matters

Uh oh!

vercel Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikelninh commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented May 28, 2026 •

edited

Loading