feat(eval): outcome-eval harness — measures impact, not just correctness#2
Closed
mikelninh wants to merge 1 commit into
Closed
feat(eval): outcome-eval harness — measures impact, not just correctness#2mikelninh wants to merge 1 commit into
mikelninh wants to merge 1 commit into
Conversation
…s. without MCP
The existing tests in gitlaw_mcp/tests/ prove correctness (the tools return the
right answer when asked correctly). They don't prove *impact* — does giving an
LLM these tools actually change how it answers a citizen's legal question?
This eval harness answers exactly that. 25 hand-labelled real Lebenslagen
questions, run twice through gpt-4o-mini:
BASELINE — no tools, answers from training-only knowledge
TREATMENT — same prompt, GitLaw tools available via OpenAI function-calling
(functionally equivalent to how an MCP client exposes them)
Headline result on the first committed run:
hallucination rate: 5.9% → 0.0% (every cited § now verified against corpus)
expected hit rate: 62.5% → 62.5% (no change — see below for honest read)
mean tool calls per question (treatment): 1.25
The hallucination story is real and reproducible. The hit-rate stability is
the honest part: gpt-4o-mini already knows the well-known statutes in our
question set; the treatment becomes more conservative (cites 1.46 § vs 2.12
in baseline) because it only emits verified citations. The diagnostic info
in eval_summary.md per-question table shows exactly which questions need
better prompting in treatment and which need harder long-tail entries to
widen the gap.
Files:
questions.json — 25 hand-labelled questions w/ expected_paragraphs
run.py — eval harness with --model / --limit flags
README.md — how to run, how to read, honest limits
eval_summary.md — latest run committed as public record (regenerated each run)
.gitignore — keeps timestamped per-run JSON dumps out of git history
Roadmap on the README: harder long-tail questions, multi-model comparison
(gpt-3.5-turbo / gpt-4o-mini / gpt-4o), citation-extraction improvements.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Owner
Author
|
Superseded by #3 — that PR contains the eval harness plus the freshness scaffold + Phase-1a live sync. Closing this in favour of the broader one. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The existing test suite proves correctness. This adds the missing layer: does the MCP measurably change the answers an LLM gives a citizen?
First committed run
The headline: every cited § the treatment model produces is verified against the corpus. Zero hallucinations. The 5.9% baseline figure includes one question (
gg-03— demo permit) where the model invented three fake paragraphs — exactly the scenario this MCP exists to prevent.Honest non-headline: hit rate is identical. The treatment model becomes more conservative (1.46 § vs 2.12 citations per answer) because it only emits verified statutes. On well-known questions in our current set, gpt-4o-mini already gets the canonical § right. Widening the gap needs harder long-tail questions where the baseline model is more likely to invent — on the roadmap.
What's in the PR
questions.json— 25 hand-labelled real Lebenslagen questions w/ expected_paragraphsrun.py— harness with--model/--limitflags, OpenAI function-calling to dispatch GitLaw toolsREADME.md— how to run, how to read the numbers, honest known limitseval_summary.md— committed as the public record of the latest run.gitignore— keeps timestamped per-run dumps out of gitWhy this matters
Reproducible, verifiable, public. Anyone can clone, run, get the same numbers. That's the foundation for any claim about "this MCP makes LLMs more truthful."
🤖 Generated with Claude Code