mikelninh · mikelninh · May 28, 2026 · May 28, 2026 · May 28, 2026 · May 28, 2026
diff --git a/.env.example b/.env.example
@@ -0,0 +1,21 @@
+# Copy this file to `.env.local` and fill in real values.
+# .env.local is gitignored — never commit your API keys.
+
+# ── Required for semantic search (search_laws, hybrid_search) ─────────
+# Used to embed queries against the FAISS vectorstore at rag/vectorstore/.
+# Get a key at https://platform.openai.com/api-keys
+# The four offline tools (verify_citation, lookup_paragraph, list_laws,
+# find_related_paragraphs) work WITHOUT a key — start there if you don't
+# want to pay for embeddings.
+OPENAI_API_KEY=sk-...
+
+# ── Optional: hosted SSE deployment (Fly.io / Railway / Cloud Run) ────
+# Only relevant if you're running the server in SSE mode for hosted agents.
+# Leave unset for stdio mode (the default for Claude Desktop / Cursor).
+# MCP_TRANSPORT=sse
+# PORT=8000
+# FASTMCP_HOST=0.0.0.0
+
+# ── Optional: alternative LLM providers (only used by the eval harness) ─
+# The eval/run.py harness uses gpt-4o-mini by default. Set to override.
+# EVAL_MODEL=gpt-4o-mini
diff --git a/.github/workflows/freshness-check.yml b/.github/workflows/freshness-check.yml
@@ -0,0 +1,47 @@
+# Freshness guard rail — runs on every PR.
+#
+# If a law file in /laws/ is edited without regenerating manifest.json, this
+# workflow fails. The check is fast (~5s on 5,942 files) and zero-config.
+#
+# A separate daily-sync workflow (planned, freshness/TRUST.md § "Phase 1")
+# will run `freshness.sync` to actually re-fetch from gesetze-im-internet.de.
+# That one needs the upstream parser wired up — this one is the simple guard
+# that catches local drift today.
+
+name: Freshness Check
+
+on:
+  pull_request:
+    paths:
+      - "laws/**"
+      - "gitlaw_mcp/freshness/manifest.json"
+      - "gitlaw_mcp/freshness/build_manifest.py"
+  push:
+    branches: [main]
+    paths:
+      - "laws/**"
+      - "gitlaw_mcp/freshness/manifest.json"
+
+jobs:
+  manifest-matches-corpus:
+    name: Manifest matches corpus
+    runs-on: ubuntu-latest
+    timeout-minutes: 5
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+          cache: pip
+
+      - name: Verify manifest.json matches /laws/ contents
+        run: python -m gitlaw_mcp.freshness.build_manifest --check
+
+      - name: Report on drift
+        if: failure()
+        run: |
+          echo "::error::Corpus drift detected. A law file changed without"
+          echo "::error::regenerating manifest.json. To fix locally:"
+          echo "::error::  python -m gitlaw_mcp.freshness.build_manifest"
+          echo "::error::Then commit the updated gitlaw_mcp/freshness/manifest.json."
diff --git a/.github/workflows/upstream-sync.yml b/.github/workflows/upstream-sync.yml
@@ -0,0 +1,51 @@
+# Daily upstream sync — HEAD-checks gesetze-im-internet.de for drift in any
+# monitored law, then commits the updated snapshot file + sync log if anything
+# changed.
+#
+# This is what makes "is our corpus current?" answerable continuously, not just
+# at the moment a developer notices. Runs unattended on schedule; opens no PRs
+# (it commits directly to main because the only files touched are the
+# snapshot record and the human-readable log — never the corpus itself).
+#
+# Phase 2 of the freshness roadmap (rewriting the markdown corpus from the
+# fresh XML) is a separate workflow.
+
+name: Upstream Sync
+
+on:
+  schedule:
+    - cron: "17 5 * * *"   # 05:17 UTC daily — early-morning Europe, off the hour
+  workflow_dispatch: {}
+
+jobs:
+  sync:
+    name: Check upstream gesetze-im-internet.de for drift
+    runs-on: ubuntu-latest
+    timeout-minutes: 10
+    permissions:
+      contents: write   # the job commits state changes back to main
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Run sync
+        id: sync
+        run: |
+          python -m gitlaw_mcp.freshness.sync
+          echo "ran=true" >> "$GITHUB_OUTPUT"
+
+      - name: Commit snapshot + log changes (if any)
+        run: |
+          git config user.name "github-actions[bot]"
+          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
+          if git diff --quiet gitlaw_mcp/freshness/upstream_snapshots.json gitlaw_mcp/freshness/sync_log.md; then
+            echo "No upstream changes detected; nothing to commit."
+            exit 0
+          fi
+          git add gitlaw_mcp/freshness/upstream_snapshots.json gitlaw_mcp/freshness/sync_log.md
+          git commit -m "chore(freshness): daily upstream sync — $(date -u +%Y-%m-%d)"
+          git push
diff --git a/gitlaw_mcp/README.md b/gitlaw_mcp/README.md
@@ -1,25 +1,64 @@
 # GitLaw MCP Server
 
 [![MCP CI](https://github.com/mikelninh/gitlaw/actions/workflows/mcp-ci.yml/badge.svg)](https://github.com/mikelninh/gitlaw/actions/workflows/mcp-ci.yml)
-[![Eval: 118/118](https://img.shields.io/badge/eval-118%2F118_(100%25)-brightgreen?logo=pytest)](gitlaw_mcp/tests/cases.json)
-[![Transport: stdio + SSE](https://img.shields.io/badge/transport-stdio_%2B_SSE-blue)](#hosted-deployment-flyio-frankfurt)
+[![Tests](https://img.shields.io/badge/tests-146%2F146-brightgreen?logo=pytest)](gitlaw_mcp/tests/)
+[![Hallucination rate](https://img.shields.io/badge/measured_hallucinations-0%25-brightgreen)](gitlaw_mcp/eval/eval_summary.md)
+[![Trust statement](https://img.shields.io/badge/trust-TRUST.md-blue)](gitlaw_mcp/freshness/TRUST.md)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](../LICENSE)
 
-> **MCP server that exposes 5,936 German laws + RAG search + citation verification as tools any LLM client can call.**
+> **Model Context Protocol server for German federal law — 5,942 statutes indexed, anti-hallucination citation verification, daily drift detection against the official source. Built for legal agents that need to ground every § they cite.**
 
-Built on top of the existing GitLaw RAG pipeline (FAISS vectorstore, OpenAI embeddings, paragraph-level chunking of all federal German laws).
+10 tools, one resource, one trust contract:
+
+| You ask Claude / Cursor… | …with GitLaw MCP it answers |
+|---|---|
+| "Verify § 573 BGB" | Returns the real paragraph text. Or `verified: false` with a structured reason. **Never invents.** |
+| "Mein Vermieter kündigt wegen Eigenbedarf — was kann ich tun?" | Semantic search finds § 574 BGB, returns the text, the LLM grounds its answer in real statute |
+| "How do you know your BGB is current?" | `check_upstream_currency("BGB")` — returns the days_behind vs. gesetze-im-internet.de live |
+| "What's the integrity hash of your corpus right now?" | `get_corpus_status()` — single SHA-256 every consumer can verify |
 
 ---
 
 ## Why this exists
 
-LLMs hallucinate German law all the time. They confidently cite `§ 999 StGB` (doesn't exist), invent paragraph titles, or swap statutes. This server gives any MCP-compatible client (Claude Desktop, Cursor, Continue, custom agents) a set of **verifiable** legal tools:
+LLMs hallucinate German law all the time. They confidently cite `§ 999 StGB` (doesn't exist), invent paragraph titles, swap statutes. We measured `gpt-4o-mini` on 25 real Lebenslagen questions: **5.9% of its cited paragraphs were fake.** That's catastrophic for a lawyer, harmful for a citizen, dishonest for AI.
+
+With GitLaw MCP available as a tool, hallucination rate drops to **0%** — the model has no reason to invent when `verify_citation` is one call away. See [`eval/eval_summary.md`](gitlaw_mcp/eval/eval_summary.md) for the reproducible report.
+
+This server gives any MCP-compatible client (Claude Desktop, Cursor, Continue, custom agents) the legal-tools surface they need:
 
-- **Semantic search** across all 5,936 laws → grounded retrieval
-- **Citation verification** → returns *the actual paragraph text* if the citation exists, or `verified: false` with a reason if not
+- **Semantic search** across all 5,942 federal statutes → grounded retrieval
+- **Citation verification** → real paragraph text or structured rejection — **no hallucinated §**
 - **Exact lookup** by abbreviation + paragraph
-- **Law enumeration** for discovery
+- **Citation-graph traversal** (94k nodes, 200k edges) — who cites whom
+- **Corpus provenance** — every served paragraph has a public source URL and SHA-256
+- **Live drift detection** — daily HEAD-check against gesetze-im-internet.de, surfaces stale law data
+
+---
+
+## How do you know it's correct? *(read this before building on top of us)*
+
+This is the most important section of the README. Trust isn't a vibe — it's evidence.
+
+| Question | Where to look |
+|---|---|
+| Is every cited § actually in the corpus? | `verify_citation()` returns `verified: false` if not. **0% hallucination measured.** |
+| Where does each law come from? | `verify_law_provenance(abbr)` → official source URL + SHA-256 + git timestamp |
+| Is the corpus the same one another agent is seeing? | `get_corpus_status()` → single aggregate SHA-256, deterministic, public |
+| Has anything changed upstream since we synced? | `check_upstream_currency(abbr)` → days behind upstream + last-modified timestamps |
+| What's your full promise / disclosure of gaps? | **Read [`freshness/TRUST.md`](gitlaw_mcp/freshness/TRUST.md) — it's the most honest legal-tech trust document you'll read this year.** |
+
+**Live drift status** (the integrity check is automated; this section reflects the latest sync):
+
+```
+6 of 36 monitored laws are stale vs. upstream gesetze-im-internet.de
+  BGB:  50 days behind   ZPO:  49 days behind   SGG:  49 days behind
+  GG:   29 days behind   HGB:  29 days behind   AO:   21 days behind
+```
+
+We tell you this *on purpose*. A citizen looking up tenant rights should know if our § 573 BGB is older than the official version. Daily cron (`upstream-sync.yml`) refreshes it automatically.
 
-The result: an LLM connected to this server can ground every legal claim in the real German Bundesrecht corpus, with a structured "I checked" / "I couldn't verify" signal on every citation.
+---
 
 ---
 
@@ -74,13 +113,25 @@ Known limitations (honest):
 
 ## Tools exposed
 
+### Retrieval & verification (the core six)
+
 | Tool | Purpose | Example |
 |---|---|---|
-| `search_laws(query, limit=5)` | Semantic search across all paragraphs (FAISS, OpenAI embeddings) | `"Beleidigung im Internet"` |
-| `verify_citation(citation)` | Parse `§ 185 StGB` style strings, return actual text or `verified: false` with reason | `"§ 185 Abs. 1 StGB"` |
-| `lookup_paragraph(abbr, paragraph)` | Exact lookup with structured input | `("StGB", "263a")` |
-| `list_laws(filter=None, limit=50)` | Enumerate available laws (4,852+ unique abbreviations indexed) | `filter="bgb"` |
-| `find_related_paragraphs(citation)` | Walk the citation graph (94K paragraphs, 200K refs) — returns who cites X *and* what X cites | `"§ 185 StGB"` |
+| `search_laws(query, limit=5)` | Semantic search across all paragraphs (FAISS + OpenAI embeddings) | `"Beleidigung im Internet"` |
+| `verify_citation(citation)` | Parse `§ 185 StGB` style strings → real text or structured rejection. **The anti-hallucination tool.** | `"§ 185 Abs. 1 StGB"` |
+| `lookup_paragraph(abbr, paragraph)` | Exact lookup when you have structured input | `("StGB", "263a")` |
+| `list_laws(filter=None, limit=50)` | Enumerate available laws (5,942 indexed) | `filter="bgb"` |
+| `find_related_paragraphs(citation)` | Walk the citation graph (94k nodes, 200k edges) — who cites X, what X cites | `"§ 185 StGB"` |
+| `hybrid_search(query, limit, expand)` | Semantic + 1-hop graph expansion in one call | `"Eigenbedarf", expand=2` |
+
+### Provenance & freshness (the four trust tools)
+
+| Tool | Purpose | Example output |
+|---|---|---|
+| `get_corpus_status()` | Single integrity hash + law count + when manifest was last built | `aggregate_sha256: b93152a9…` |
+| `verify_law_provenance(abbr)` | Source URL + SHA-256 + git timestamp for one law | source_url, corpus_sha256, corpus_bytes |
+| `check_upstream_currency(abbr)` | Compares our git timestamp vs. gesetze-im-internet.de Last-Modified | `drift_status: "stale", days_behind: 50` |
+| `list_drifted_laws()` | Every monitored law where upstream is newer than our corpus, sorted by staleness | sorted list of drifted laws |
 
 Plus the resource `gitlaw://law/{abbreviation}` returning the full markdown content of a law.
 
@@ -307,13 +358,42 @@ default `gitlaw_mcp/Dockerfile` stays in stdio mode for Claude Desktop.
 ## Roadmap
 
 - [x] ~~HTTP/SSE transport~~ — done (Dockerfile.fly + fly.toml + SSE in server.py)
-- [x] ~~Citation graph + `find_related_paragraphs` tool~~ — done (94K nodes, 200K edges)
-- [ ] Eval harness: 50+ hand-labelled citation-verification cases, run in CI
+- [x] ~~Citation graph + `find_related_paragraphs` tool~~ — done (94k nodes, 200k edges)
+- [x] ~~Eval harness with reproducible hallucination measurement~~ — done (`eval/`, 25 questions)
+- [x] ~~Corpus provenance manifest~~ — done (`freshness/manifest.json`, per-law SHA-256)
+- [x] ~~Live drift detection vs. gesetze-im-internet.de~~ — done (`freshness/sync.py`, daily cron)
+- [ ] **Phase 1b** — auto-resync stale markdown when drift detected (needs XML→markdown parser, ~2 weekends)
+- [ ] Nested `§ X Abs. Y Nr. Z` citation parsing
 - [ ] Schweizer / Österreichischer Rechtskorpus (already partially in `laws_*.py`)
+- [ ] Landesrecht (state-level law)
 - [ ] Per-tenant rate limiting (relevant once multi-tenant SSE clients exist)
 
 ---
 
+## Part of an MCP-server portfolio
+
+GitLaw MCP is one of three Model Context Protocol servers built as
+**a thin agent-readable layer over real-world workflows**. The pattern is
+deliberately reproducible — same architecture, different domains:
+
+- **[gitlaw-mcp](https://github.com/mikelninh/gitlaw)** — German federal law (you're here)
+- **[safevoice-mcp](https://github.com/mikelninh/safevoice)** — victim-of-digital-harassment tooling: classification, applicable §, Strafantrag-Fristen, jurisdiction, anonymisation (DE/AT/CH/UK)
+- **[grailsense](https://github.com/mikelninh/grailsense)** — NFT collector intelligence over Blockscout: archetype classification + shareable soul cards
+
+Together they're an early sketch of what **public-good civic infrastructure**
+looks like in the LLM era: open source, MIT, verifiable, composable.
+
+---
+
+## Contact + community
+
+- **Issues / bug reports** — [GitHub Issues](https://github.com/mikelninh/gitlaw/issues)
+- **Strategic discussion** — [GitHub Discussions](https://github.com/mikelninh/gitlaw/discussions)
+- **Direct** — open an issue tagged `question` if it's broader than a bug
+- **Built by** [@mikelninh](https://github.com/mikelninh) — Berlin
+
+---
+
 ## License
 
-MIT. Part of the [GitLaw](../README.md) project — open infrastructure for digital legal services in Germany.
+MIT. Part of the [GitLaw](../README.md) project — open infrastructure for digital legal services in Germany. The underlying corpus of German federal law is public domain per § 5 UrhG.
diff --git a/gitlaw_mcp/eval/.gitignore b/gitlaw_mcp/eval/.gitignore
@@ -0,0 +1,3 @@
+# Per-run output reports — eval_summary.md is committed as the latest snapshot,
+# but the timestamped per-run JSON dumps are not (they grow without bound).
+eval_report_*.json
diff --git a/gitlaw_mcp/eval/README.md b/gitlaw_mcp/eval/README.md
@@ -0,0 +1,81 @@
+# GitLaw MCP — outcome eval
+
+This directory is the **public, reproducible eval harness** for GitLaw MCP. It
+measures the answer-quality difference between an LLM answering legal questions
+*without* tools versus *with* the GitLaw MCP tools available.
+
+The whole point: claims about anti-hallucination need data, not vibes. This is
+the data.
+
+---
+
+## How to read the headline number
+
+Run produces two metrics that matter:
+
+- **Hallucination rate** — fraction of cited paragraphs that don't exist in the
+  German Bundesrecht corpus. Lower is better. The MCP is designed to drive this
+  to zero, because every cited § goes through `verify_citation` before the model
+  uses it.
+- **Expected-hit rate** — fraction of questions where the answer cited at least
+  one of the paragraphs a competent legal answer would mention. Higher is better.
+
+A useful third number: **citations per answer**. Treatment is usually lower than
+baseline because the model becomes more conservative (only cites what it
+verified). That's by design — but worth watching, because over-conservatism can
+cost hit-rate.
+
+## Run it yourself
+
+```bash
+cd /path/to/gitlaw
+source .env.local                       # OPENAI_API_KEY
+python -m gitlaw_mcp.eval.run --limit 5     # cheap smoke (~30s, ~$0.005)
+python -m gitlaw_mcp.eval.run               # full 25 questions (~2 min, ~$0.05)
+python -m gitlaw_mcp.eval.run --model gpt-4o   # bigger model
+```
+
+Output:
+- `eval_report_<utc-timestamp>.json` — full per-question detail (input, both
+  answers, every citation, verification result for each)
+- `eval_summary.md` — the markdown summary that gets committed to the repo
+
+## Question set (`questions.json`)
+
+25 hand-labelled questions across Miete, Arbeit, Strafrecht, Erbrecht,
+Familie, Grundgesetz, Zivilrecht, Datenschutz. Each comes with
+`expected_paragraphs` — the canonical citation(s) we hand-verified against
+gesetze-im-internet.de.
+
+The set is intentionally biased toward **realistic Lebenslagen** a citizen,
+tenant, employee, or harassment victim would actually ask — not law-school
+exam questions. Adding harder long-tail questions (less-common statutes
+where the baseline model is more likely to invent) is on the roadmap; those
+will widen the gap further.
+
+## Latest run (committed)
+
+See [`eval_summary.md`](./eval_summary.md). It's regenerated on every run and
+the most recent committed version is the public record. Past runs sit in git
+history.
+
+## What the eval cannot show
+
+Honest limits:
+
+- **One language only (German).** A multilingual eval would need a multilingual
+  question set + ground truth in each language.
+- **One model class.** We test `gpt-4o-mini` by default. The gap widens with
+  weaker models (e.g. `gpt-3.5-turbo`) and narrows with stronger ones
+  (`gpt-4o`, Claude Opus). The `--model` flag lets you check.
+- **Hit-rate is binary per question.** We don't yet score "partial hit"
+  (cited a related but adjacent §).
+- **Citation extraction is regex-based.** Models sometimes phrase citations
+  in ways our regex misses — that under-counts citations for both conditions
+  equally, but distorts absolute hit-rate downward.
+
+These are known. Patches welcome.
+
+## License
+
+Same as the rest of GitLaw MCP — MIT.
diff --git a/gitlaw_mcp/eval/__init__.py b/gitlaw_mcp/eval/__init__.py
@@ -0,0 +1 @@
+"""Outcome eval harness — does GitLaw MCP measurably reduce hallucinations?"""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Outcome eval harness — does GitLaw MCP measurably reduce hallucinations?"""