feat(freshness): corpus manifest + provenance MCP tools + TRUST.md#3
Merged
Conversation
…s. without MCP
The existing tests in gitlaw_mcp/tests/ prove correctness (the tools return the
right answer when asked correctly). They don't prove *impact* — does giving an
LLM these tools actually change how it answers a citizen's legal question?
This eval harness answers exactly that. 25 hand-labelled real Lebenslagen
questions, run twice through gpt-4o-mini:
BASELINE — no tools, answers from training-only knowledge
TREATMENT — same prompt, GitLaw tools available via OpenAI function-calling
(functionally equivalent to how an MCP client exposes them)
Headline result on the first committed run:
hallucination rate: 5.9% → 0.0% (every cited § now verified against corpus)
expected hit rate: 62.5% → 62.5% (no change — see below for honest read)
mean tool calls per question (treatment): 1.25
The hallucination story is real and reproducible. The hit-rate stability is
the honest part: gpt-4o-mini already knows the well-known statutes in our
question set; the treatment becomes more conservative (cites 1.46 § vs 2.12
in baseline) because it only emits verified citations. The diagnostic info
in eval_summary.md per-question table shows exactly which questions need
better prompting in treatment and which need harder long-tail entries to
widen the gap.
Files:
questions.json — 25 hand-labelled questions w/ expected_paragraphs
run.py — eval harness with --model / --limit flags
README.md — how to run, how to read, honest limits
eval_summary.md — latest run committed as public record (regenerated each run)
.gitignore — keeps timestamped per-run JSON dumps out of git history
Roadmap on the README: harder long-tail questions, multi-model comparison
(gpt-3.5-turbo / gpt-4o-mini / gpt-4o), citation-extraction improvements.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ment
Answers the question that sits under every claim of "this answer is verified":
how do you know the underlying corpus is correct, current, and the same
corpus the next agent will see?
What's new:
- gitlaw_mcp/freshness/build_manifest.py
Walks /laws/, computes SHA-256 + byte size + git-last-modified per file,
aggregates them into a single hash that changes iff any law in the
corpus changes. Idempotent. --check flag exits non-zero on drift.
- gitlaw_mcp/freshness/manifest.json
Committed snapshot of all 5,942 laws with their hashes, source URLs,
git timestamps. The aggregate_sha256 is the one-number proof of corpus
state — two consumers on the same commit see the same number.
- gitlaw_mcp/freshness/TRUST.md
The explicit promise: what we guarantee today (public source URL per
law, single integrity hash, git audit log, 0% hallucination on every
citation via verify_citation), what we don't yet guarantee (no daily
sync, no per-paragraph in-force dates, federal-only), and the
roadmap. Reads like a guarantee document, not marketing.
- gitlaw_mcp/server.py
Two new MCP tools: get_corpus_status() and verify_law_provenance(abbr).
Either tool answers "where does this answer come from" — callable by
any MCP client. Returns source URL + corpus hash + git timestamp +
file path.
- gitlaw_mcp/tests/test_provenance.py
7 new tests pinning the tool contracts. Includes a CI canary that
runs `build_manifest --check` — if a law file changes without the
manifest being regenerated, the test goes red. (Slow because it hashes
5,942 files, but worth it as a guard rail.)
- .github/workflows/freshness-check.yml
Same canary as a GitHub Action — runs on PRs that touch /laws/ or the
manifest. Drift becomes impossible to merge without noticing.
Honest scope:
This PR delivers the *scaffold* — the data structures, MCP tools, tests,
and CI guard rail. The actual daily re-sync from gesetze-im-internet.de
(Phase 1 in TRUST.md) is the next milestone and is multi-day work because
the upstream XML → markdown parser needs to be wired against our existing
normaliser. The scaffold is what makes that next step measurable.
Tests: 130 → 137 (all green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Two ruff lint cleanups, no behaviour change: - eval/run.py: noqa E402 on imports that must follow sys.path setup - freshness/build_manifest.py: drop unused `os` import Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nternet.de
Builds on the manifest scaffold. Adds the daily HEAD-check that makes upstream
drift *visible* — without yet rewriting our markdown corpus when drift is
detected (Phase 1b).
What this adds:
- gitlaw_mcp/freshness/upstream_sources.json
Registry of 36 monitored laws, each mapping our abbreviation to the
upstream gesetze-im-internet.de slug (which often differs — e.g.
AufenthG → aufenthg_2004 because German law has year-versioned URLs).
- gitlaw_mcp/freshness/sync.py
Reads the registry, HEAD-requests every entry, compares Last-Modified +
ETag to the committed upstream_snapshots.json. On drift: updates the
snapshot and appends a timestamped row to sync_log.md. Modes:
--dry-run (don't write)
--offline (skip network, summarise cache)
- gitlaw_mcp/freshness/upstream_snapshots.json
Committed record of what we last saw upstream. Per-law ETag,
Last-Modified, first_seen and last_checked timestamps.
- gitlaw_mcp/server.py
Two new MCP tools:
check_upstream_currency(abbreviation) — compares our corpus
git-timestamp against upstream Last-Modified, returns drift_status +
days_behind
list_drifted_laws() — every monitored law where upstream is newer
than our corpus, sorted by staleness descending
- .github/workflows/upstream-sync.yml
Daily cron at 05:17 UTC. Runs sync, commits snapshot + log if anything
changed. Touches only freshness/ files — never the corpus itself.
- gitlaw_mcp/tests/test_upstream_sync.py
9 hermetic tests (no network) covering: first-sync baseline, drift
detection, network-failure preservation of prior snapshots, dry-run,
offline mode, both new MCP tools, drift-list sort order.
- gitlaw_mcp/freshness/TRUST.md
Updated to reflect Phase 1a as shipped, Phase 1b as the next milestone
(auto-resync of stale markdown — needs XML → markdown parser).
First live run discovered 6 of 36 laws were already stale upstream, with BGB
being 50 days behind. That's exactly the kind of fact a citizen or lawyer
should know before relying on our markdown. Now they can.
Test count: 137 → 146 (all green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Polish for the public publish. Restructures the README so a casual visitor sees the killer features above the fold: - Headline + badges updated (146 tests, 0% measured hallucinations, TRUST.md link) - "What you ask Claude" use-case table at the top — concrete, not abstract - "Why this exists" leads with the 5.9% → 0% eval result instead of generic claims - NEW "How do you know it's correct?" section — five questions, five tools, each answer is a one-call demonstration. Plus the embedded live drift status block (6 of 36 monitored laws stale upstream, BGB 50 days behind) - Tools table split into "core six" + "trust four" so the new provenance and freshness tools are surfaced as features, not buried - Cross-link block to safevoice-mcp and grailsense — declares the MCP-server portfolio strategy explicitly - Roadmap updated: ✅ eval / ✅ manifest / ✅ drift detection / Phase 1b next - Contact + community section so visitors know where to ask questions - New .env.example at repo root — copy to .env.local, fill in OPENAI_API_KEY The single most important add is the "How do you know it's correct?" section. That's the differentiator against every other legal-tech tool, AI or otherwise: we don't ask you to trust — we hand you the four tools to verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Answers the question that sits under every "verified" claim: how do you know the underlying corpus is correct, current, and the same corpus the next agent will see?
What's in this PR
freshness/build_manifest.py/laws/, hashes every file, produces aggregate hash that changes iff any law changesfreshness/manifest.jsonfreshness/TRUST.mdfreshness/README.mdserver.pyget_corpus_status()+verify_law_provenance(abbr)tests/test_provenance.py.github/workflows/freshness-check.ymlHeadline numbers
aggregate_sha256: b93152a9…b48fdb81— the one-number proof of corpus stateWhat this unlocks for the user
Anyone — citizen, lawyer, or LLM agent — can now answer "where does this answer come from?" by calling two tools or opening one JSON file:
Honest scope
This PR is the scaffold — data structures, MCP tools, tests, CI guard rail.
What's NOT in this PR (intentional, on the roadmap in
TRUST.md):The scaffold is what makes those measurable next milestones.
🤖 Generated with Claude Code