Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions gitlaw_mcp/eval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Per-run output reports — eval_summary.md is committed as the latest snapshot,
# but the timestamped per-run JSON dumps are not (they grow without bound).
eval_report_*.json
81 changes: 81 additions & 0 deletions gitlaw_mcp/eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# GitLaw MCP — outcome eval

This directory is the **public, reproducible eval harness** for GitLaw MCP. It
measures the answer-quality difference between an LLM answering legal questions
*without* tools versus *with* the GitLaw MCP tools available.

The whole point: claims about anti-hallucination need data, not vibes. This is
the data.

---

## How to read the headline number

Run produces two metrics that matter:

- **Hallucination rate** — fraction of cited paragraphs that don't exist in the
German Bundesrecht corpus. Lower is better. The MCP is designed to drive this
to zero, because every cited § goes through `verify_citation` before the model
uses it.
- **Expected-hit rate** — fraction of questions where the answer cited at least
one of the paragraphs a competent legal answer would mention. Higher is better.

A useful third number: **citations per answer**. Treatment is usually lower than
baseline because the model becomes more conservative (only cites what it
verified). That's by design — but worth watching, because over-conservatism can
cost hit-rate.

## Run it yourself

```bash
cd /path/to/gitlaw
source .env.local # OPENAI_API_KEY
python -m gitlaw_mcp.eval.run --limit 5 # cheap smoke (~30s, ~$0.005)
python -m gitlaw_mcp.eval.run # full 25 questions (~2 min, ~$0.05)
python -m gitlaw_mcp.eval.run --model gpt-4o # bigger model
```

Output:
- `eval_report_<utc-timestamp>.json` — full per-question detail (input, both
answers, every citation, verification result for each)
- `eval_summary.md` — the markdown summary that gets committed to the repo

## Question set (`questions.json`)

25 hand-labelled questions across Miete, Arbeit, Strafrecht, Erbrecht,
Familie, Grundgesetz, Zivilrecht, Datenschutz. Each comes with
`expected_paragraphs` — the canonical citation(s) we hand-verified against
gesetze-im-internet.de.

The set is intentionally biased toward **realistic Lebenslagen** a citizen,
tenant, employee, or harassment victim would actually ask — not law-school
exam questions. Adding harder long-tail questions (less-common statutes
where the baseline model is more likely to invent) is on the roadmap; those
will widen the gap further.

## Latest run (committed)

See [`eval_summary.md`](./eval_summary.md). It's regenerated on every run and
the most recent committed version is the public record. Past runs sit in git
history.

## What the eval cannot show

Honest limits:

- **One language only (German).** A multilingual eval would need a multilingual
question set + ground truth in each language.
- **One model class.** We test `gpt-4o-mini` by default. The gap widens with
weaker models (e.g. `gpt-3.5-turbo`) and narrows with stronger ones
(`gpt-4o`, Claude Opus). The `--model` flag lets you check.
- **Hit-rate is binary per question.** We don't yet score "partial hit"
(cited a related but adjacent §).
- **Citation extraction is regex-based.** Models sometimes phrase citations
in ways our regex misses — that under-counts citations for both conditions
equally, but distorts absolute hit-rate downward.

These are known. Patches welcome.

## License

Same as the rest of GitLaw MCP — MIT.
1 change: 1 addition & 0 deletions gitlaw_mcp/eval/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Outcome eval harness — does GitLaw MCP measurably reduce hallucinations?"""
48 changes: 48 additions & 0 deletions gitlaw_mcp/eval/eval_summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# GitLaw MCP — outcome eval

_Run at 2026-05-28T16:55:23+00:00 · model `gpt-4o-mini` · 24 questions · 869.5s_

## Headline

- **Hallucination rate**: `5.9%` → `0.0%` (+5.9%)
- **Expected-citation hit rate**: `62.5%` → `62.5%` (+0.0%)
- Mean tool calls per question (treatment): **1.25**

## Per-condition stats

| Metric | Baseline | Treatment (+GitLaw MCP) |
|---|---:|---:|
| Hallucination rate | 5.9% | 0.0% |
| Expected hit rate | 62.5% | 62.5% |
| Citations per question | 2.12 | 1.46 |
| Total citations | 51 | 35 |
| Hallucinated citations | 3 | 0 |

## Per-question results

| # | Category | Question | Baseline hit | Treatment hit | Halluc B→T |
|---|---|---|:-:|:-:|:-:|
| miete-01 | Miete | Mein Vermieter kündigt mir wegen Eigenbedarf. Kann ich … | ✗ | ✓ | 0 → 0 |
| miete-02 | Miete | Wie hoch darf mein Vermieter die Miete erhöhen? | ✓ | ✓ | 0 → 0 |
| miete-03 | Miete | Ich habe Schimmel in der Wohnung. Darf ich die Miete kü… | ✓ | ✓ | 0 → 0 |
| miete-04 | Miete | Wie lange ist die Kündigungsfrist wenn ich als Mieter k… | ✗ | ✓ | 0 → 0 |
| miete-05 | Miete | Mein Vermieter zahlt mir die Kaution nicht zurück. Welc… | ✗ | ✓ | 0 → 0 |
| arbeit-01 | Arbeit | Mein Chef will mir kündigen. Welche Rechte habe ich bei… | ✓ | ✗ | 0 → 0 |
| arbeit-02 | Arbeit | Wieviele Urlaubstage stehen mir pro Jahr mindestens zu? | ✓ | ✓ | 0 → 0 |
| arbeit-03 | Arbeit | Was darf mein Arbeitgeber zur maximalen täglichen Arbei… | ✓ | ✓ | 0 → 0 |
| arbeit-04 | Arbeit | Was sind die Aufgaben und Rechte des Betriebsrats? | ✗ | ✗ | 0 → 0 |
| straf-01 | Strafrecht | Mein Ex bedroht mich auf WhatsApp damit, mein Wohnort z… | ✓ | ✓ | 0 → 0 |
| straf-02 | Strafrecht | Was ist Stalking strafrechtlich in Deutschland? | ✓ | ✓ | 0 → 0 |
| straf-03 | Strafrecht | Jemand hat mich öffentlich auf Instagram beleidigt. Wel… | ✓ | ✓ | 0 → 0 |
| straf-04 | Strafrecht | Was ist die rechtliche Definition von Betrug? | ✓ | ✗ | 0 → 0 |
| straf-05 | Strafrecht | Jemand greift mich körperlich an. Wann ist Notwehr erla… | ✓ | ✓ | 0 → 0 |
| straf-06 | Strafrecht | Was ist Nötigung im deutschen Strafrecht? | ✓ | ✗ | 0 → 0 |
| erbe-01 | Erbrecht | Mein Vater ist gestorben ohne Testament. Wer erbt? | ✗ | ✗ | 0 → 0 |
| erbe-02 | Erbrecht | Wie schreibe ich rechtsgültig ein handschriftliches Tes… | ✓ | ✓ | 0 → 0 |
| fam-01 | Familie | Ab wann habe ich Anspruch auf Elternzeit? | ✓ | ✓ | 0 → 0 |
| gg-01 | Grundgesetz | Habe ich das Recht meine politische Meinung öffentlich … | ✗ | ✗ | 0 → 0 |
| gg-02 | Grundgesetz | Ist die Menschenwürde in Deutschland antastbar? | ✗ | ✓ | 0 → 0 |
| gg-03 | Grundgesetz | Brauche ich für eine politische Demonstration eine Gene… | ✗ | ✗ | 3 → 0 |
| haft-01 | Zivilrecht | Mein Nachbar hat mein Auto beschädigt. Auf welcher Grun… | ✓ | ✗ | 0 → 0 |
| haft-02 | Zivilrecht | Mein Kaufvertrag wurde nicht erfüllt. Welcher Paragraph… | ✓ | ✓ | 0 → 0 |
| data-01 | Datenschutz | Was sind die Grundbegriffe des deutschen Datenschutzes … | ✗ | ✗ | 0 → 0 |
36 changes: 36 additions & 0 deletions gitlaw_mcp/eval/questions.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"_doc": "Outcome-eval question set. Each question is a real Lebenslage a Berlin citizen, tenant, employee, or victim might ask. expected_paragraphs is the canonical citation(s) any competent answer should ground in. Hand-labelled — citations are independently confirmed against gesetze-im-internet.de and our cases.json verification tests. The eval harness scores answers against this ground truth.",
"questions": [
{"id": "miete-01", "category": "Miete", "question": "Mein Vermieter kündigt mir wegen Eigenbedarf. Kann ich widersprechen und wenn ja wie?", "expected_paragraphs": ["§ 574 BGB"]},
{"id": "miete-02", "category": "Miete", "question": "Wie hoch darf mein Vermieter die Miete erhöhen?", "expected_paragraphs": ["§ 558 BGB"]},
{"id": "miete-03", "category": "Miete", "question": "Ich habe Schimmel in der Wohnung. Darf ich die Miete kürzen?", "expected_paragraphs": ["§ 536 BGB"]},
{"id": "miete-04", "category": "Miete", "question": "Wie lange ist die Kündigungsfrist wenn ich als Mieter kündige?", "expected_paragraphs": ["§ 573c BGB"]},
{"id": "miete-05", "category": "Miete", "question": "Mein Vermieter zahlt mir die Kaution nicht zurück. Welche Frist hat er?", "expected_paragraphs": ["§ 551 BGB"]},

{"id": "arbeit-01", "category": "Arbeit", "question": "Mein Chef will mir kündigen. Welche Rechte habe ich beim Kündigungsschutz?", "expected_paragraphs": ["§ 1 KSchG"]},
{"id": "arbeit-02", "category": "Arbeit", "question": "Wieviele Urlaubstage stehen mir pro Jahr mindestens zu?", "expected_paragraphs": ["§ 3 BUrlG"]},
{"id": "arbeit-03", "category": "Arbeit", "question": "Was darf mein Arbeitgeber zur maximalen täglichen Arbeitszeit verlangen?", "expected_paragraphs": ["§ 3 ArbZG"]},
{"id": "arbeit-04", "category": "Arbeit", "question": "Was sind die Aufgaben und Rechte des Betriebsrats?", "expected_paragraphs": ["§ 1 BetrVG"]},

{"id": "straf-01", "category": "Strafrecht", "question": "Mein Ex bedroht mich auf WhatsApp damit, mein Wohnort zu veröffentlichen. Was ist das strafrechtlich?", "expected_paragraphs": ["§ 241 StGB"]},
{"id": "straf-02", "category": "Strafrecht", "question": "Was ist Stalking strafrechtlich in Deutschland?", "expected_paragraphs": ["§ 238 StGB"]},
{"id": "straf-03", "category": "Strafrecht", "question": "Jemand hat mich öffentlich auf Instagram beleidigt. Welcher Paragraph greift?", "expected_paragraphs": ["§ 185 StGB"]},
{"id": "straf-04", "category": "Strafrecht", "question": "Was ist die rechtliche Definition von Betrug?", "expected_paragraphs": ["§ 263 StGB"]},
{"id": "straf-05", "category": "Strafrecht", "question": "Jemand greift mich körperlich an. Wann ist Notwehr erlaubt?", "expected_paragraphs": ["§ 32 StGB"]},
{"id": "straf-06", "category": "Strafrecht", "question": "Was ist Nötigung im deutschen Strafrecht?", "expected_paragraphs": ["§ 240 StGB"]},

{"id": "erbe-01", "category": "Erbrecht", "question": "Mein Vater ist gestorben ohne Testament. Wer erbt?", "expected_paragraphs": ["§ 1922 BGB"]},
{"id": "erbe-02", "category": "Erbrecht", "question": "Wie schreibe ich rechtsgültig ein handschriftliches Testament?", "expected_paragraphs": ["§ 2247 BGB"]},

{"id": "fam-01", "category": "Familie", "question": "Ab wann habe ich Anspruch auf Elternzeit?", "expected_paragraphs": ["§ 15 BEEG"]},

{"id": "gg-01", "category": "Grundgesetz","question": "Habe ich das Recht meine politische Meinung öffentlich zu äußern?", "expected_paragraphs": ["Art. 5 GG"]},
{"id": "gg-02", "category": "Grundgesetz","question": "Ist die Menschenwürde in Deutschland antastbar?", "expected_paragraphs": ["Art. 1 GG"]},
{"id": "gg-03", "category": "Grundgesetz","question": "Brauche ich für eine politische Demonstration eine Genehmigung?", "expected_paragraphs": ["Art. 8 GG"]},

{"id": "haft-01", "category": "Zivilrecht", "question": "Mein Nachbar hat mein Auto beschädigt. Auf welcher Grundlage kann ich Schadensersatz fordern?", "expected_paragraphs": ["§ 823 BGB"]},
{"id": "haft-02", "category": "Zivilrecht", "question": "Mein Kaufvertrag wurde nicht erfüllt. Welcher Paragraph regelt die Pflichten beim Kaufvertrag?", "expected_paragraphs": ["§ 433 BGB"]},

{"id": "data-01", "category": "Datenschutz","question": "Was sind die Grundbegriffe des deutschen Datenschutzes nach BDSG?", "expected_paragraphs": ["§ 4 BDSG"]}
]
}
Loading
Loading