dqt's semantic layer uses Andrej Karpathy's LLM Wiki pattern as its knowledge representation: a two-folder Obsidian vault where raw/ holds atomic source-of-truth documents and wiki/ holds synthesised knowledge produced by the system.
Reference: Build Your Second Brain With Claude Code, Karpathy's Method
vault/
├── raw/ ← source-of-truth (edit these)
│ ├── datasets/
│ │ └── gigler_transactions.md # dataset-level metadata
│ └── columns/
│ └── gigler_transactions/
│ └── amount_usd.md # per-column atomic note
└── wiki/ ← synthesised knowledge (generated)
├── metrics/
│ │ └── weekly_transaction_volume.md
└── lineage/
└── causality.md # discovered causal edges
raw/ = what you own. Dataset descriptions, column semantics, classifications, PII flags, owners. These are the atomic documents you write and version-control.
wiki/ = what dqt derives. Metric definitions, causal relationships, lineage traces, cross-dataset dependencies. Generated by write_vault() or by Claude Code reading your raw documents.
The standard path: author a semantic.yaml, then generate the full vault.
version: "1"
datasets:
- id: gigler_transactions
description: "Transaction records from the Gigler freelance marketplace."
owner: "platform-analytics@gigler.com"
domain: "platform"
freshness_sla_hours: 4
source_files:
- "data/gigler_transactions_2024_q1.csv"
columns:
- name: transaction_id
description: "Unique transaction identifier. Format: TXN-NNNNNN."
classification: internal
pii: false
- name: amount_usd
description: "Gross transaction value in USD. Values >$5,000 are enterprise contracts."
classification: internal
pii: false
unit: USD
- name: status
description: "Transaction lifecycle state."
classification: internal
pii: false
- name: rating
description: "Buyer satisfaction score 1.0–5.0. NULL until completion."
classification: internal
pii: falsefrom dqt.semantic import load_semantic_manifest
from dqt.lineage import LineageGraph, LineageNode, LineageEdge, write_vault
manifest = load_semantic_manifest("semantic.yaml")
# Build graph: declare nodes and edges you know about
graph = LineageGraph(nodes=[], edges=[])
# Add dataset nodes
for ds in manifest.datasets:
graph.nodes.append(LineageNode(id=ds.id, kind="dataset", label=ds.id))
# Add a causal edge: marketing spend → transaction volume
graph.edges.append(LineageEdge(
source="marketing_campaigns.spend_usd",
target="gigler_transactions.amount_usd",
kind="causality",
confidence=0.60,
lag_weeks=2,
description="Acquisition spend drives transaction volume with 2-week lag",
))
write_vault(manifest, graph, vault_dir="vault/", vault_title="Gigler Knowledge Graph")Point Obsidian at the vault/ directory. Every [[wiki-link]] in the generated files is clickable, every column doc shows upstream and downstream lineage.
If your team already has BI tickets, SQL reports, and warehouse docs, you don't need to write semantic.yaml from scratch. Use Claude Code to extract the semantic layer directly from those artifacts.
This mirrors the second brain pattern: dump raw artifacts into raw/, let Claude Code synthesise the semantic knowledge into wiki/.
raw/
├── tickets/ ← Trello cards, Jira tickets, Notion pages
├── sql/ ← SQL queries, dbt models, Python scripts
├── reports/ ← BI report HTMLs, PDFs, screenshots
└── schema/ ← ERDs, warehouse DDL, data catalog exports
Context: Your GTM team creates Trello tickets for every BI report they request. Each ticket has required fields (SQL query, report HTML/PDF, description of what was asked) and links back to the dashboard. These tickets are the richest documentation of what your data means to the business.
Export your BI board from Trello (Board → Show Menu → More → Print and Export → Export as JSON). Or use the Trello API:
curl "https://api.trello.com/1/boards/{BOARD_ID}/cards?attachments=true&key={KEY}&token={TOKEN}" \
> raw/tickets/trello_export.jsonPlace each ticket's attachments into raw/:
raw/
├── tickets/
│ ├── trello_export.json
│ ├── GTM-001-weekly-revenue-by-geo/
│ │ ├── ticket.md # ticket description + fields
│ │ ├── weekly_revenue.sql # the SQL that generates the report
│ │ └── weekly_revenue.html # the report itself
│ ├── GTM-002-vendor-retention-cohorts/
│ │ ├── ticket.md
│ │ ├── vendor_cohorts.py
│ │ └── vendor_cohorts.html
│ └── GTM-003-campaign-roi-by-channel/
│ ├── ticket.md
│ ├── campaign_roi.sql
│ └── campaign_roi.html
├── schema/
│ └── warehouse_ddl.sql # CREATE TABLE statements
└── reports/
└── q1_2024_board_deck.pdf # quarterly review with metric definitions
Format ticket.md files like this (matches Trello's required fields):
# GTM-001: Weekly Revenue by Geography
**Requested by:** Sarah Chen, VP Marketing
**Date:** 2024-03-15
**Status:** Delivered
## What was asked
Weekly breakdown of platform revenue by buyer geography (country + continent).
Alert the GTM team if any region drops >15% week-over-week.
## Data sources
- `gigler_transactions` — amount_usd, buyer_country, date
- `gig_prices` — avg_price_usd (used to normalise by category)
## Business definitions
- **Revenue**: sum of `amount_usd` for `status = 'completed'` transactions
- **Region**: buyer_country mapped to continent via the geo lookup table
- **WoW drop alert**: triggered when current_week / prior_week < 0.85
## Known data issues
- buyer_country is NULL for ~0.3% of pre-2023 rows — excluded from regional totals
- amount_usd >$5,000 are enterprise contracts — reported separatelycd path/to/your/vault
claude .Or open the vault directory in Claude Code's VS Code extension.
Use this prompt (or paste it into Claude Code directly):
Read all files in raw/tickets/, raw/sql/, raw/schema/, and raw/reports/.
Extract a complete semantic layer from these business artifacts:
1. For each dataset referenced across the tickets and SQL files, create a wiki/datasets/<name>.md with:
- Business description (what this table represents)
- Owner and domain (inferred from ticket requestors and descriptions)
- Key columns with business meanings (from how they're used in SQL and reports)
- Known data issues (from the "Known data issues" sections in tickets)
- Freshness expectations (from the alert thresholds mentioned in tickets)
2. For each metric defined in the tickets and reports, create wiki/metrics/<name>.md with:
- Definition (the exact SQL aggregation logic)
- Business owner (who requested it)
- How it's used in decisions (from the ticket context)
- Alert thresholds (from the business rules in tickets)
3. Create wiki/lineage/causality.md listing causal relationships inferred from the SQL joins,
CTEs, and the ticket descriptions of what drives what.
4. Generate semantic.yaml in this directory — a dqt semantic manifest capturing all datasets
and columns you found, with descriptions written for a data engineer audience.
Use [[wiki-link]] syntax to cross-reference documents.
Do not hallucinate column names or metric definitions — only use what appears in the files.
Claude Code will generate wiki/ content and semantic.yaml. Review:
wiki/datasets/— do the descriptions match how the business actually uses these tables?wiki/metrics/— are the SQL definitions exactly right?wiki/lineage/causality.md— are the causal claims defensible?semantic.yaml— check owners, domains, PII flags before committing
Edit any files directly in Obsidian — the [[wiki-links]] update automatically.
Once semantic.yaml looks right, run the dqt vault writer to add per-column atomic notes with lineage:
from dqt.semantic import load_semantic_manifest
from dqt.lineage import LineageGraph, LineageEdge, write_vault
manifest = load_semantic_manifest("semantic.yaml")
# Build LineageGraph from the causal edges Claude Code discovered
# (copy these from wiki/lineage/causality.md)
graph = LineageGraph(nodes=[], edges=[
LineageEdge(
source="marketing_campaigns.spend_usd",
target="gigler_transactions.amount_usd",
kind="causality",
confidence=0.60,
lag_weeks=2,
description="Acquisition spend drives transaction volume (from GTM-001)",
),
# ... add more edges
])
write_vault(manifest, graph, vault_dir="vault/", vault_title="Your Company Knowledge Graph")This adds structured raw/datasets/ and raw/columns/ docs on top of the wiki/ content Claude Code wrote.
Once the vault exists, Claude Code becomes a genuinely grounded assistant for data tasks:
Generate dqt checks from business rules:
Read wiki/metrics/weekly_revenue.md and the thresholds in raw/tickets/GTM-001/.
Write dqt checks in YAML that would have caught the data issues described in the tickets.
Explain an incident:
Read wiki/lineage/causality.md and wiki/datasets/gigler_transactions.md.
The null_fraction on amount_usd spiked to 12% this morning.
Which upstream tables should I check first?
Onboard a new analyst:
Read all files in wiki/datasets/ and wiki/metrics/.
Write a 1-page data dictionary for the platform domain, suitable for a new analyst
who has never seen the warehouse.
Audit PII coverage:
Read all files in raw/columns/.
List every column where pii: true and which tickets reference them.
Are there any columns that sound like PII but aren't flagged?
from dqt.semantic import (
SemanticManifest,
DatasetDescription,
ColumnDescription,
load_semantic_manifest,
)
from dqt.lineage import (
LineageGraph,
LineageNode,
LineageEdge,
write_vault,
)
# Load from YAML
manifest = load_semantic_manifest("semantic.yaml")
# Or build programmatically
manifest = SemanticManifest(
version="1",
datasets=[
DatasetDescription(
id="gigler_transactions",
description="Transaction records from the Gigler marketplace.",
owner="platform-analytics@gigler.com",
domain="platform",
freshness_sla_hours=4,
columns=[
ColumnDescription(
name="amount_usd",
description="Gross transaction value in USD.",
classification="internal",
pii=False,
unit="USD",
),
],
),
],
)
# Access programmatically
ds = manifest.get_dataset("gigler_transactions")
col = manifest.get_column("gigler_transactions", "amount_usd")
# Build lineage graph
graph = LineageGraph(
nodes=[
LineageNode(id="gigler_transactions", kind="dataset", label="Transactions"),
LineageNode(id="weekly_transaction_volume", kind="metric", label="Weekly Volume",
meta={"description": "Sum of completed transactions per week",
"unit": "count", "domain": "platform"}),
],
edges=[
LineageEdge(
source="gigler_transactions.amount_usd",
target="weekly_transaction_volume",
kind="aggregates",
),
],
)
# Write vault
write_vault(manifest, graph, vault_dir="vault/", vault_title="My Knowledge Graph")