feat(freshness): corpus manifest + provenance MCP tools + TRUST.md by mikelninh · Pull Request #3 · mikelninh/gitlaw

mikelninh · 2026-05-28T18:12:03Z

Answers the question that sits under every "verified" claim: how do you know the underlying corpus is correct, current, and the same corpus the next agent will see?

What's in this PR

File	What it does
`freshness/build_manifest.py`	Walks `/laws/`, hashes every file, produces aggregate hash that changes iff any law changes
`freshness/manifest.json`	5,942 laws with per-file SHA-256, byte size, git timestamp, source URL — committed as the public corpus snapshot
`freshness/TRUST.md`	Explicit promise document: what we guarantee today, what we don't yet, the gap-closing roadmap
`freshness/README.md`	How to use the scaffolding
`server.py`	Two new MCP tools: `get_corpus_status()` + `verify_law_provenance(abbr)`
`tests/test_provenance.py`	7 new tests, includes CI canary that catches uncommitted corpus drift
`.github/workflows/freshness-check.yml`	Same canary as a GitHub Action

Headline numbers

5,942 laws in the corpus, each with public hash
aggregate_sha256: b93152a9…b48fdb81 — the one-number proof of corpus state
Test count: 130 → 137 (all green)
Hash check latency: ~4 minutes locally (the slow CI canary is by design)

What this unlocks for the user

Anyone — citizen, lawyer, or LLM agent — can now answer "where does this answer come from?" by calling two tools or opening one JSON file:

verify_law_provenance("BGB")
→ {
    "source_url": "https://www.gesetze-im-internet.de/bgb/",
    "corpus_path": "laws/bgb.md",
    "corpus_sha256": "5a6fc44acf93bf722d8721bf7948d484f370ce4754e29e35d57de82c4a09d2da",
    "corpus_bytes": 1623932,
    "git_last_modified_iso": "2026-04-07T21:07:06+07:00"
  }

Honest scope

This PR is the scaffold — data structures, MCP tools, tests, CI guard rail.

What's NOT in this PR (intentional, on the roadmap in TRUST.md):

Live daily re-sync from gesetze-im-internet.de (needs upstream XML → markdown parser, multi-day work)
Per-paragraph "in-force since" dates
Landesrecht + EU corpus expansion
Notarised weekly snapshots on Hugging Face Datasets

The scaffold is what makes those measurable next milestones.

🤖 Generated with Claude Code

…s. without MCP The existing tests in gitlaw_mcp/tests/ prove correctness (the tools return the right answer when asked correctly). They don't prove *impact* — does giving an LLM these tools actually change how it answers a citizen's legal question? This eval harness answers exactly that. 25 hand-labelled real Lebenslagen questions, run twice through gpt-4o-mini: BASELINE — no tools, answers from training-only knowledge TREATMENT — same prompt, GitLaw tools available via OpenAI function-calling (functionally equivalent to how an MCP client exposes them) Headline result on the first committed run: hallucination rate: 5.9% → 0.0% (every cited § now verified against corpus) expected hit rate: 62.5% → 62.5% (no change — see below for honest read) mean tool calls per question (treatment): 1.25 The hallucination story is real and reproducible. The hit-rate stability is the honest part: gpt-4o-mini already knows the well-known statutes in our question set; the treatment becomes more conservative (cites 1.46 § vs 2.12 in baseline) because it only emits verified citations. The diagnostic info in eval_summary.md per-question table shows exactly which questions need better prompting in treatment and which need harder long-tail entries to widen the gap. Files: questions.json — 25 hand-labelled questions w/ expected_paragraphs run.py — eval harness with --model / --limit flags README.md — how to run, how to read, honest limits eval_summary.md — latest run committed as public record (regenerated each run) .gitignore — keeps timestamped per-run JSON dumps out of git history Roadmap on the README: harder long-tail questions, multi-model comparison (gpt-3.5-turbo / gpt-4o-mini / gpt-4o), citation-extraction improvements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ment Answers the question that sits under every claim of "this answer is verified": how do you know the underlying corpus is correct, current, and the same corpus the next agent will see? What's new: - gitlaw_mcp/freshness/build_manifest.py Walks /laws/, computes SHA-256 + byte size + git-last-modified per file, aggregates them into a single hash that changes iff any law in the corpus changes. Idempotent. --check flag exits non-zero on drift. - gitlaw_mcp/freshness/manifest.json Committed snapshot of all 5,942 laws with their hashes, source URLs, git timestamps. The aggregate_sha256 is the one-number proof of corpus state — two consumers on the same commit see the same number. - gitlaw_mcp/freshness/TRUST.md The explicit promise: what we guarantee today (public source URL per law, single integrity hash, git audit log, 0% hallucination on every citation via verify_citation), what we don't yet guarantee (no daily sync, no per-paragraph in-force dates, federal-only), and the roadmap. Reads like a guarantee document, not marketing. - gitlaw_mcp/server.py Two new MCP tools: get_corpus_status() and verify_law_provenance(abbr). Either tool answers "where does this answer come from" — callable by any MCP client. Returns source URL + corpus hash + git timestamp + file path. - gitlaw_mcp/tests/test_provenance.py 7 new tests pinning the tool contracts. Includes a CI canary that runs `build_manifest --check` — if a law file changes without the manifest being regenerated, the test goes red. (Slow because it hashes 5,942 files, but worth it as a guard rail.) - .github/workflows/freshness-check.yml Same canary as a GitHub Action — runs on PRs that touch /laws/ or the manifest. Drift becomes impossible to merge without noticing. Honest scope: This PR delivers the *scaffold* — the data structures, MCP tools, tests, and CI guard rail. The actual daily re-sync from gesetze-im-internet.de (Phase 1 in TRUST.md) is the next milestone and is multi-day work because the upstream XML → markdown parser needs to be wired against our existing normaliser. The scaffold is what makes that next step measurable. Tests: 130 → 137 (all green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-28T18:12:06Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
gitlaw	Ready	Preview, Comment	May 28, 2026 7:56pm

Two ruff lint cleanups, no behaviour change: - eval/run.py: noqa E402 on imports that must follow sys.path setup - freshness/build_manifest.py: drop unused `os` import Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nternet.de Builds on the manifest scaffold. Adds the daily HEAD-check that makes upstream drift *visible* — without yet rewriting our markdown corpus when drift is detected (Phase 1b). What this adds: - gitlaw_mcp/freshness/upstream_sources.json Registry of 36 monitored laws, each mapping our abbreviation to the upstream gesetze-im-internet.de slug (which often differs — e.g. AufenthG → aufenthg_2004 because German law has year-versioned URLs). - gitlaw_mcp/freshness/sync.py Reads the registry, HEAD-requests every entry, compares Last-Modified + ETag to the committed upstream_snapshots.json. On drift: updates the snapshot and appends a timestamped row to sync_log.md. Modes: --dry-run (don't write) --offline (skip network, summarise cache) - gitlaw_mcp/freshness/upstream_snapshots.json Committed record of what we last saw upstream. Per-law ETag, Last-Modified, first_seen and last_checked timestamps. - gitlaw_mcp/server.py Two new MCP tools: check_upstream_currency(abbreviation) — compares our corpus git-timestamp against upstream Last-Modified, returns drift_status + days_behind list_drifted_laws() — every monitored law where upstream is newer than our corpus, sorted by staleness descending - .github/workflows/upstream-sync.yml Daily cron at 05:17 UTC. Runs sync, commits snapshot + log if anything changed. Touches only freshness/ files — never the corpus itself. - gitlaw_mcp/tests/test_upstream_sync.py 9 hermetic tests (no network) covering: first-sync baseline, drift detection, network-failure preservation of prior snapshots, dry-run, offline mode, both new MCP tools, drift-list sort order. - gitlaw_mcp/freshness/TRUST.md Updated to reflect Phase 1a as shipped, Phase 1b as the next milestone (auto-resync of stale markdown — needs XML → markdown parser). First live run discovered 6 of 36 laws were already stale upstream, with BGB being 50 days behind. That's exactly the kind of fact a citizen or lawyer should know before relying on our markdown. Now they can. Test count: 137 → 146 (all green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Polish for the public publish. Restructures the README so a casual visitor sees the killer features above the fold: - Headline + badges updated (146 tests, 0% measured hallucinations, TRUST.md link) - "What you ask Claude" use-case table at the top — concrete, not abstract - "Why this exists" leads with the 5.9% → 0% eval result instead of generic claims - NEW "How do you know it's correct?" section — five questions, five tools, each answer is a one-call demonstration. Plus the embedded live drift status block (6 of 36 monitored laws stale upstream, BGB 50 days behind) - Tools table split into "core six" + "trust four" so the new provenance and freshness tools are surfaced as features, not buried - Cross-link block to safevoice-mcp and grailsense — declares the MCP-server portfolio strategy explicitly - Roadmap updated: ✅ eval / ✅ manifest / ✅ drift detection / Phase 1b next - Contact + community section so visitors know where to ask questions - New .env.example at repo root — copy to .env.local, fill in OPENAI_API_KEY The single most important add is the "How do you know it's correct?" section. That's the differentiator against every other legal-tech tool, AI or otherwise: we don't ask you to trust — we hand you the four tools to verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hallochupi-sketch and others added 2 commits May 28, 2026 18:57

vercel Bot deployed to Preview May 28, 2026 18:13 View deployment

hallochupi-sketch and others added 2 commits May 28, 2026 20:17

mikelninh mentioned this pull request May 28, 2026

feat(eval): outcome-eval harness — measures impact, not just correctness #2

Closed

vercel Bot deployed to Preview May 28, 2026 18:24 View deployment

fix(types): narrow in_degree before int() — mypy call-overload

810fbc3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 28, 2026 19:42 View deployment

vercel Bot deployed to Preview May 28, 2026 19:56 View deployment

mikelninh merged commit daef586 into main May 28, 2026
6 checks passed

mikelninh deleted the feat/freshness-provenance branch May 28, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(freshness): corpus manifest + provenance MCP tools + TRUST.md#3

feat(freshness): corpus manifest + provenance MCP tools + TRUST.md#3
mikelninh merged 6 commits into
mainfrom
feat/freshness-provenance

mikelninh commented May 28, 2026

Uh oh!

vercel Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikelninh commented May 28, 2026

What's in this PR

Headline numbers

What this unlocks for the user

Honest scope

Uh oh!

vercel Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented May 28, 2026 •

edited

Loading