Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copy this file to `.env.local` and fill in real values.
# .env.local is gitignored — never commit your API keys.

# ── Required for semantic search (search_laws, hybrid_search) ─────────
# Used to embed queries against the FAISS vectorstore at rag/vectorstore/.
# Get a key at https://platform.openai.com/api-keys
# The four offline tools (verify_citation, lookup_paragraph, list_laws,
# find_related_paragraphs) work WITHOUT a key — start there if you don't
# want to pay for embeddings.
OPENAI_API_KEY=sk-...

# ── Optional: hosted SSE deployment (Fly.io / Railway / Cloud Run) ────
# Only relevant if you're running the server in SSE mode for hosted agents.
# Leave unset for stdio mode (the default for Claude Desktop / Cursor).
# MCP_TRANSPORT=sse
# PORT=8000
# FASTMCP_HOST=0.0.0.0

# ── Optional: alternative LLM providers (only used by the eval harness) ─
# The eval/run.py harness uses gpt-4o-mini by default. Set to override.
# EVAL_MODEL=gpt-4o-mini
47 changes: 47 additions & 0 deletions .github/workflows/freshness-check.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Freshness guard rail — runs on every PR.
#
# If a law file in /laws/ is edited without regenerating manifest.json, this
# workflow fails. The check is fast (~5s on 5,942 files) and zero-config.
#
# A separate daily-sync workflow (planned, freshness/TRUST.md § "Phase 1")
# will run `freshness.sync` to actually re-fetch from gesetze-im-internet.de.
# That one needs the upstream parser wired up — this one is the simple guard
# that catches local drift today.

name: Freshness Check

on:
pull_request:
paths:
- "laws/**"
- "gitlaw_mcp/freshness/manifest.json"
- "gitlaw_mcp/freshness/build_manifest.py"
push:
branches: [main]
paths:
- "laws/**"
- "gitlaw_mcp/freshness/manifest.json"

jobs:
manifest-matches-corpus:
name: Manifest matches corpus
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: pip

- name: Verify manifest.json matches /laws/ contents
run: python -m gitlaw_mcp.freshness.build_manifest --check

- name: Report on drift
if: failure()
run: |
echo "::error::Corpus drift detected. A law file changed without"
echo "::error::regenerating manifest.json. To fix locally:"
echo "::error:: python -m gitlaw_mcp.freshness.build_manifest"
echo "::error::Then commit the updated gitlaw_mcp/freshness/manifest.json."
51 changes: 51 additions & 0 deletions .github/workflows/upstream-sync.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Daily upstream sync — HEAD-checks gesetze-im-internet.de for drift in any
# monitored law, then commits the updated snapshot file + sync log if anything
# changed.
#
# This is what makes "is our corpus current?" answerable continuously, not just
# at the moment a developer notices. Runs unattended on schedule; opens no PRs
# (it commits directly to main because the only files touched are the
# snapshot record and the human-readable log — never the corpus itself).
#
# Phase 2 of the freshness roadmap (rewriting the markdown corpus from the
# fresh XML) is a separate workflow.

name: Upstream Sync

on:
schedule:
- cron: "17 5 * * *" # 05:17 UTC daily — early-morning Europe, off the hour
workflow_dispatch: {}

jobs:
sync:
name: Check upstream gesetze-im-internet.de for drift
runs-on: ubuntu-latest
timeout-minutes: 10
permissions:
contents: write # the job commits state changes back to main

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Run sync
id: sync
run: |
python -m gitlaw_mcp.freshness.sync
echo "ran=true" >> "$GITHUB_OUTPUT"

- name: Commit snapshot + log changes (if any)
run: |
git config user.name "github-actions[bot]"
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
if git diff --quiet gitlaw_mcp/freshness/upstream_snapshots.json gitlaw_mcp/freshness/sync_log.md; then
echo "No upstream changes detected; nothing to commit."
exit 0
fi
git add gitlaw_mcp/freshness/upstream_snapshots.json gitlaw_mcp/freshness/sync_log.md
git commit -m "chore(freshness): daily upstream sync — $(date -u +%Y-%m-%d)"
git push
114 changes: 97 additions & 17 deletions gitlaw_mcp/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,64 @@
# GitLaw MCP Server

[![MCP CI](https://github.com/mikelninh/gitlaw/actions/workflows/mcp-ci.yml/badge.svg)](https://github.com/mikelninh/gitlaw/actions/workflows/mcp-ci.yml)
[![Eval: 118/118](https://img.shields.io/badge/eval-118%2F118_(100%25)-brightgreen?logo=pytest)](gitlaw_mcp/tests/cases.json)
[![Transport: stdio + SSE](https://img.shields.io/badge/transport-stdio_%2B_SSE-blue)](#hosted-deployment-flyio-frankfurt)
[![Tests](https://img.shields.io/badge/tests-146%2F146-brightgreen?logo=pytest)](gitlaw_mcp/tests/)
[![Hallucination rate](https://img.shields.io/badge/measured_hallucinations-0%25-brightgreen)](gitlaw_mcp/eval/eval_summary.md)
[![Trust statement](https://img.shields.io/badge/trust-TRUST.md-blue)](gitlaw_mcp/freshness/TRUST.md)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](../LICENSE)

> **MCP server that exposes 5,936 German laws + RAG search + citation verification as tools any LLM client can call.**
> **Model Context Protocol server for German federal law — 5,942 statutes indexed, anti-hallucination citation verification, daily drift detection against the official source. Built for legal agents that need to ground every § they cite.**

Built on top of the existing GitLaw RAG pipeline (FAISS vectorstore, OpenAI embeddings, paragraph-level chunking of all federal German laws).
10 tools, one resource, one trust contract:

| You ask Claude / Cursor… | …with GitLaw MCP it answers |
|---|---|
| "Verify § 573 BGB" | Returns the real paragraph text. Or `verified: false` with a structured reason. **Never invents.** |
| "Mein Vermieter kündigt wegen Eigenbedarf — was kann ich tun?" | Semantic search finds § 574 BGB, returns the text, the LLM grounds its answer in real statute |
| "How do you know your BGB is current?" | `check_upstream_currency("BGB")` — returns the days_behind vs. gesetze-im-internet.de live |
| "What's the integrity hash of your corpus right now?" | `get_corpus_status()` — single SHA-256 every consumer can verify |

---

## Why this exists

LLMs hallucinate German law all the time. They confidently cite `§ 999 StGB` (doesn't exist), invent paragraph titles, or swap statutes. This server gives any MCP-compatible client (Claude Desktop, Cursor, Continue, custom agents) a set of **verifiable** legal tools:
LLMs hallucinate German law all the time. They confidently cite `§ 999 StGB` (doesn't exist), invent paragraph titles, swap statutes. We measured `gpt-4o-mini` on 25 real Lebenslagen questions: **5.9% of its cited paragraphs were fake.** That's catastrophic for a lawyer, harmful for a citizen, dishonest for AI.

With GitLaw MCP available as a tool, hallucination rate drops to **0%** — the model has no reason to invent when `verify_citation` is one call away. See [`eval/eval_summary.md`](gitlaw_mcp/eval/eval_summary.md) for the reproducible report.

This server gives any MCP-compatible client (Claude Desktop, Cursor, Continue, custom agents) the legal-tools surface they need:

- **Semantic search** across all 5,936 laws → grounded retrieval
- **Citation verification** → returns *the actual paragraph text* if the citation exists, or `verified: false` with a reason if not
- **Semantic search** across all 5,942 federal statutes → grounded retrieval
- **Citation verification** → real paragraph text or structured rejection — **no hallucinated §**
- **Exact lookup** by abbreviation + paragraph
- **Law enumeration** for discovery
- **Citation-graph traversal** (94k nodes, 200k edges) — who cites whom
- **Corpus provenance** — every served paragraph has a public source URL and SHA-256
- **Live drift detection** — daily HEAD-check against gesetze-im-internet.de, surfaces stale law data

---

## How do you know it's correct? *(read this before building on top of us)*

This is the most important section of the README. Trust isn't a vibe — it's evidence.

| Question | Where to look |
|---|---|
| Is every cited § actually in the corpus? | `verify_citation()` returns `verified: false` if not. **0% hallucination measured.** |
| Where does each law come from? | `verify_law_provenance(abbr)` → official source URL + SHA-256 + git timestamp |
| Is the corpus the same one another agent is seeing? | `get_corpus_status()` → single aggregate SHA-256, deterministic, public |
| Has anything changed upstream since we synced? | `check_upstream_currency(abbr)` → days behind upstream + last-modified timestamps |
| What's your full promise / disclosure of gaps? | **Read [`freshness/TRUST.md`](gitlaw_mcp/freshness/TRUST.md) — it's the most honest legal-tech trust document you'll read this year.** |

**Live drift status** (the integrity check is automated; this section reflects the latest sync):

```
6 of 36 monitored laws are stale vs. upstream gesetze-im-internet.de
BGB: 50 days behind ZPO: 49 days behind SGG: 49 days behind
GG: 29 days behind HGB: 29 days behind AO: 21 days behind
```

We tell you this *on purpose*. A citizen looking up tenant rights should know if our § 573 BGB is older than the official version. Daily cron (`upstream-sync.yml`) refreshes it automatically.

The result: an LLM connected to this server can ground every legal claim in the real German Bundesrecht corpus, with a structured "I checked" / "I couldn't verify" signal on every citation.
---

---

Expand Down Expand Up @@ -74,13 +113,25 @@ Known limitations (honest):

## Tools exposed

### Retrieval & verification (the core six)

| Tool | Purpose | Example |
|---|---|---|
| `search_laws(query, limit=5)` | Semantic search across all paragraphs (FAISS, OpenAI embeddings) | `"Beleidigung im Internet"` |
| `verify_citation(citation)` | Parse `§ 185 StGB` style strings, return actual text or `verified: false` with reason | `"§ 185 Abs. 1 StGB"` |
| `lookup_paragraph(abbr, paragraph)` | Exact lookup with structured input | `("StGB", "263a")` |
| `list_laws(filter=None, limit=50)` | Enumerate available laws (4,852+ unique abbreviations indexed) | `filter="bgb"` |
| `find_related_paragraphs(citation)` | Walk the citation graph (94K paragraphs, 200K refs) — returns who cites X *and* what X cites | `"§ 185 StGB"` |
| `search_laws(query, limit=5)` | Semantic search across all paragraphs (FAISS + OpenAI embeddings) | `"Beleidigung im Internet"` |
| `verify_citation(citation)` | Parse `§ 185 StGB` style strings → real text or structured rejection. **The anti-hallucination tool.** | `"§ 185 Abs. 1 StGB"` |
| `lookup_paragraph(abbr, paragraph)` | Exact lookup when you have structured input | `("StGB", "263a")` |
| `list_laws(filter=None, limit=50)` | Enumerate available laws (5,942 indexed) | `filter="bgb"` |
| `find_related_paragraphs(citation)` | Walk the citation graph (94k nodes, 200k edges) — who cites X, what X cites | `"§ 185 StGB"` |
| `hybrid_search(query, limit, expand)` | Semantic + 1-hop graph expansion in one call | `"Eigenbedarf", expand=2` |

### Provenance & freshness (the four trust tools)

| Tool | Purpose | Example output |
|---|---|---|
| `get_corpus_status()` | Single integrity hash + law count + when manifest was last built | `aggregate_sha256: b93152a9…` |
| `verify_law_provenance(abbr)` | Source URL + SHA-256 + git timestamp for one law | source_url, corpus_sha256, corpus_bytes |
| `check_upstream_currency(abbr)` | Compares our git timestamp vs. gesetze-im-internet.de Last-Modified | `drift_status: "stale", days_behind: 50` |
| `list_drifted_laws()` | Every monitored law where upstream is newer than our corpus, sorted by staleness | sorted list of drifted laws |

Plus the resource `gitlaw://law/{abbreviation}` returning the full markdown content of a law.

Expand Down Expand Up @@ -307,13 +358,42 @@ default `gitlaw_mcp/Dockerfile` stays in stdio mode for Claude Desktop.
## Roadmap

- [x] ~~HTTP/SSE transport~~ — done (Dockerfile.fly + fly.toml + SSE in server.py)
- [x] ~~Citation graph + `find_related_paragraphs` tool~~ — done (94K nodes, 200K edges)
- [ ] Eval harness: 50+ hand-labelled citation-verification cases, run in CI
- [x] ~~Citation graph + `find_related_paragraphs` tool~~ — done (94k nodes, 200k edges)
- [x] ~~Eval harness with reproducible hallucination measurement~~ — done (`eval/`, 25 questions)
- [x] ~~Corpus provenance manifest~~ — done (`freshness/manifest.json`, per-law SHA-256)
- [x] ~~Live drift detection vs. gesetze-im-internet.de~~ — done (`freshness/sync.py`, daily cron)
- [ ] **Phase 1b** — auto-resync stale markdown when drift detected (needs XML→markdown parser, ~2 weekends)
- [ ] Nested `§ X Abs. Y Nr. Z` citation parsing
- [ ] Schweizer / Österreichischer Rechtskorpus (already partially in `laws_*.py`)
- [ ] Landesrecht (state-level law)
- [ ] Per-tenant rate limiting (relevant once multi-tenant SSE clients exist)

---

## Part of an MCP-server portfolio

GitLaw MCP is one of three Model Context Protocol servers built as
**a thin agent-readable layer over real-world workflows**. The pattern is
deliberately reproducible — same architecture, different domains:

- **[gitlaw-mcp](https://github.com/mikelninh/gitlaw)** — German federal law (you're here)
- **[safevoice-mcp](https://github.com/mikelninh/safevoice)** — victim-of-digital-harassment tooling: classification, applicable §, Strafantrag-Fristen, jurisdiction, anonymisation (DE/AT/CH/UK)
- **[grailsense](https://github.com/mikelninh/grailsense)** — NFT collector intelligence over Blockscout: archetype classification + shareable soul cards

Together they're an early sketch of what **public-good civic infrastructure**
looks like in the LLM era: open source, MIT, verifiable, composable.

---

## Contact + community

- **Issues / bug reports** — [GitHub Issues](https://github.com/mikelninh/gitlaw/issues)
- **Strategic discussion** — [GitHub Discussions](https://github.com/mikelninh/gitlaw/discussions)
- **Direct** — open an issue tagged `question` if it's broader than a bug
- **Built by** [@mikelninh](https://github.com/mikelninh) — Berlin

---

## License

MIT. Part of the [GitLaw](../README.md) project — open infrastructure for digital legal services in Germany.
MIT. Part of the [GitLaw](../README.md) project — open infrastructure for digital legal services in Germany. The underlying corpus of German federal law is public domain per § 5 UrhG.
3 changes: 3 additions & 0 deletions gitlaw_mcp/eval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Per-run output reports — eval_summary.md is committed as the latest snapshot,
# but the timestamped per-run JSON dumps are not (they grow without bound).
eval_report_*.json
81 changes: 81 additions & 0 deletions gitlaw_mcp/eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# GitLaw MCP — outcome eval

This directory is the **public, reproducible eval harness** for GitLaw MCP. It
measures the answer-quality difference between an LLM answering legal questions
*without* tools versus *with* the GitLaw MCP tools available.

The whole point: claims about anti-hallucination need data, not vibes. This is
the data.

---

## How to read the headline number

Run produces two metrics that matter:

- **Hallucination rate** — fraction of cited paragraphs that don't exist in the
German Bundesrecht corpus. Lower is better. The MCP is designed to drive this
to zero, because every cited § goes through `verify_citation` before the model
uses it.
- **Expected-hit rate** — fraction of questions where the answer cited at least
one of the paragraphs a competent legal answer would mention. Higher is better.

A useful third number: **citations per answer**. Treatment is usually lower than
baseline because the model becomes more conservative (only cites what it
verified). That's by design — but worth watching, because over-conservatism can
cost hit-rate.

## Run it yourself

```bash
cd /path/to/gitlaw
source .env.local # OPENAI_API_KEY
python -m gitlaw_mcp.eval.run --limit 5 # cheap smoke (~30s, ~$0.005)
python -m gitlaw_mcp.eval.run # full 25 questions (~2 min, ~$0.05)
python -m gitlaw_mcp.eval.run --model gpt-4o # bigger model
```

Output:
- `eval_report_<utc-timestamp>.json` — full per-question detail (input, both
answers, every citation, verification result for each)
- `eval_summary.md` — the markdown summary that gets committed to the repo

## Question set (`questions.json`)

25 hand-labelled questions across Miete, Arbeit, Strafrecht, Erbrecht,
Familie, Grundgesetz, Zivilrecht, Datenschutz. Each comes with
`expected_paragraphs` — the canonical citation(s) we hand-verified against
gesetze-im-internet.de.

The set is intentionally biased toward **realistic Lebenslagen** a citizen,
tenant, employee, or harassment victim would actually ask — not law-school
exam questions. Adding harder long-tail questions (less-common statutes
where the baseline model is more likely to invent) is on the roadmap; those
will widen the gap further.

## Latest run (committed)

See [`eval_summary.md`](./eval_summary.md). It's regenerated on every run and
the most recent committed version is the public record. Past runs sit in git
history.

## What the eval cannot show

Honest limits:

- **One language only (German).** A multilingual eval would need a multilingual
question set + ground truth in each language.
- **One model class.** We test `gpt-4o-mini` by default. The gap widens with
weaker models (e.g. `gpt-3.5-turbo`) and narrows with stronger ones
(`gpt-4o`, Claude Opus). The `--model` flag lets you check.
- **Hit-rate is binary per question.** We don't yet score "partial hit"
(cited a related but adjacent §).
- **Citation extraction is regex-based.** Models sometimes phrase citations
in ways our regex misses — that under-counts citations for both conditions
equally, but distorts absolute hit-rate downward.

These are known. Patches welcome.

## License

Same as the rest of GitLaw MCP — MIT.
1 change: 1 addition & 0 deletions gitlaw_mcp/eval/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Outcome eval harness — does GitLaw MCP measurably reduce hallucinations?"""
Loading
Loading