Skip to content

Latest commit

 

History

History
362 lines (284 loc) · 12.4 KB

File metadata and controls

362 lines (284 loc) · 12.4 KB

Semantic Layer & LLM Wiki

dqt's semantic layer uses Andrej Karpathy's LLM Wiki pattern as its knowledge representation: a two-folder Obsidian vault where raw/ holds atomic source-of-truth documents and wiki/ holds synthesised knowledge produced by the system.

Reference: Build Your Second Brain With Claude Code, Karpathy's Method


Vault structure

vault/
├── raw/                          ← source-of-truth (edit these)
│   ├── datasets/
│   │   └── gigler_transactions.md   # dataset-level metadata
│   └── columns/
│       └── gigler_transactions/
│           └── amount_usd.md        # per-column atomic note
└── wiki/                         ← synthesised knowledge (generated)
    ├── metrics/
│   │   └── weekly_transaction_volume.md
    └── lineage/
        └── causality.md             # discovered causal edges

raw/ = what you own. Dataset descriptions, column semantics, classifications, PII flags, owners. These are the atomic documents you write and version-control.

wiki/ = what dqt derives. Metric definitions, causal relationships, lineage traces, cross-dataset dependencies. Generated by write_vault() or by Claude Code reading your raw documents.


Generating the vault from YAML

The standard path: author a semantic.yaml, then generate the full vault.

1. Write semantic.yaml

version: "1"
datasets:
  - id: gigler_transactions
    description: "Transaction records from the Gigler freelance marketplace."
    owner: "platform-analytics@gigler.com"
    domain: "platform"
    freshness_sla_hours: 4
    source_files:
      - "data/gigler_transactions_2024_q1.csv"
    columns:
      - name: transaction_id
        description: "Unique transaction identifier. Format: TXN-NNNNNN."
        classification: internal
        pii: false
      - name: amount_usd
        description: "Gross transaction value in USD. Values >$5,000 are enterprise contracts."
        classification: internal
        pii: false
        unit: USD
      - name: status
        description: "Transaction lifecycle state."
        classification: internal
        pii: false
      - name: rating
        description: "Buyer satisfaction score 1.0–5.0. NULL until completion."
        classification: internal
        pii: false

2. Build the lineage graph and write the vault

from dqt.semantic import load_semantic_manifest
from dqt.lineage import LineageGraph, LineageNode, LineageEdge, write_vault

manifest = load_semantic_manifest("semantic.yaml")

# Build graph: declare nodes and edges you know about
graph = LineageGraph(nodes=[], edges=[])

# Add dataset nodes
for ds in manifest.datasets:
    graph.nodes.append(LineageNode(id=ds.id, kind="dataset", label=ds.id))

# Add a causal edge: marketing spend → transaction volume
graph.edges.append(LineageEdge(
    source="marketing_campaigns.spend_usd",
    target="gigler_transactions.amount_usd",
    kind="causality",
    confidence=0.60,
    lag_weeks=2,
    description="Acquisition spend drives transaction volume with 2-week lag",
))

write_vault(manifest, graph, vault_dir="vault/", vault_title="Gigler Knowledge Graph")

3. Open in Obsidian

Point Obsidian at the vault/ directory. Every [[wiki-link]] in the generated files is clickable, every column doc shows upstream and downstream lineage.


Building the semantic layer from existing business artifacts

If your team already has BI tickets, SQL reports, and warehouse docs, you don't need to write semantic.yaml from scratch. Use Claude Code to extract the semantic layer directly from those artifacts.

The workflow

This mirrors the second brain pattern: dump raw artifacts into raw/, let Claude Code synthesise the semantic knowledge into wiki/.

raw/
├── tickets/          ← Trello cards, Jira tickets, Notion pages
├── sql/              ← SQL queries, dbt models, Python scripts
├── reports/          ← BI report HTMLs, PDFs, screenshots
└── schema/           ← ERDs, warehouse DDL, data catalog exports

Step-by-step: Trello + BI reports → semantic layer

Context: Your GTM team creates Trello tickets for every BI report they request. Each ticket has required fields (SQL query, report HTML/PDF, description of what was asked) and links back to the dashboard. These tickets are the richest documentation of what your data means to the business.

Step 1 — Export Trello tickets

Export your BI board from Trello (Board → Show Menu → More → Print and Export → Export as JSON). Or use the Trello API:

curl "https://api.trello.com/1/boards/{BOARD_ID}/cards?attachments=true&key={KEY}&token={TOKEN}" \
  > raw/tickets/trello_export.json

Place each ticket's attachments into raw/:

raw/
├── tickets/
│   ├── trello_export.json
│   ├── GTM-001-weekly-revenue-by-geo/
│   │   ├── ticket.md              # ticket description + fields
│   │   ├── weekly_revenue.sql     # the SQL that generates the report
│   │   └── weekly_revenue.html    # the report itself
│   ├── GTM-002-vendor-retention-cohorts/
│   │   ├── ticket.md
│   │   ├── vendor_cohorts.py
│   │   └── vendor_cohorts.html
│   └── GTM-003-campaign-roi-by-channel/
│       ├── ticket.md
│       ├── campaign_roi.sql
│       └── campaign_roi.html
├── schema/
│   └── warehouse_ddl.sql          # CREATE TABLE statements
└── reports/
    └── q1_2024_board_deck.pdf     # quarterly review with metric definitions

Format ticket.md files like this (matches Trello's required fields):

# GTM-001: Weekly Revenue by Geography

**Requested by:** Sarah Chen, VP Marketing
**Date:** 2024-03-15
**Status:** Delivered

## What was asked
Weekly breakdown of platform revenue by buyer geography (country + continent).
Alert the GTM team if any region drops >15% week-over-week.

## Data sources
- `gigler_transactions` — amount_usd, buyer_country, date
- `gig_prices` — avg_price_usd (used to normalise by category)

## Business definitions
- **Revenue**: sum of `amount_usd` for `status = 'completed'` transactions
- **Region**: buyer_country mapped to continent via the geo lookup table
- **WoW drop alert**: triggered when current_week / prior_week < 0.85

## Known data issues
- buyer_country is NULL for ~0.3% of pre-2023 rows — excluded from regional totals
- amount_usd >$5,000 are enterprise contracts — reported separately

Step 2 — Point Claude Code at the vault

cd path/to/your/vault
claude .

Or open the vault directory in Claude Code's VS Code extension.

Step 3 — Prompt Claude Code to synthesise the semantic layer

Use this prompt (or paste it into Claude Code directly):

Read all files in raw/tickets/, raw/sql/, raw/schema/, and raw/reports/.

Extract a complete semantic layer from these business artifacts:

1. For each dataset referenced across the tickets and SQL files, create a wiki/datasets/<name>.md with:
   - Business description (what this table represents)
   - Owner and domain (inferred from ticket requestors and descriptions)
   - Key columns with business meanings (from how they're used in SQL and reports)
   - Known data issues (from the "Known data issues" sections in tickets)
   - Freshness expectations (from the alert thresholds mentioned in tickets)

2. For each metric defined in the tickets and reports, create wiki/metrics/<name>.md with:
   - Definition (the exact SQL aggregation logic)
   - Business owner (who requested it)
   - How it's used in decisions (from the ticket context)
   - Alert thresholds (from the business rules in tickets)

3. Create wiki/lineage/causality.md listing causal relationships inferred from the SQL joins,
   CTEs, and the ticket descriptions of what drives what.

4. Generate semantic.yaml in this directory — a dqt semantic manifest capturing all datasets
   and columns you found, with descriptions written for a data engineer audience.

Use [[wiki-link]] syntax to cross-reference documents.
Do not hallucinate column names or metric definitions — only use what appears in the files.

Step 4 — Review and refine

Claude Code will generate wiki/ content and semantic.yaml. Review:

  • wiki/datasets/ — do the descriptions match how the business actually uses these tables?
  • wiki/metrics/ — are the SQL definitions exactly right?
  • wiki/lineage/causality.md — are the causal claims defensible?
  • semantic.yaml — check owners, domains, PII flags before committing

Edit any files directly in Obsidian — the [[wiki-links]] update automatically.

Step 5 — Generate the dqt vault

Once semantic.yaml looks right, run the dqt vault writer to add per-column atomic notes with lineage:

from dqt.semantic import load_semantic_manifest
from dqt.lineage import LineageGraph, LineageEdge, write_vault

manifest = load_semantic_manifest("semantic.yaml")

# Build LineageGraph from the causal edges Claude Code discovered
# (copy these from wiki/lineage/causality.md)
graph = LineageGraph(nodes=[], edges=[
    LineageEdge(
        source="marketing_campaigns.spend_usd",
        target="gigler_transactions.amount_usd",
        kind="causality",
        confidence=0.60,
        lag_weeks=2,
        description="Acquisition spend drives transaction volume (from GTM-001)",
    ),
    # ... add more edges
])

write_vault(manifest, graph, vault_dir="vault/", vault_title="Your Company Knowledge Graph")

This adds structured raw/datasets/ and raw/columns/ docs on top of the wiki/ content Claude Code wrote.


Using the completed vault for ongoing data work

Once the vault exists, Claude Code becomes a genuinely grounded assistant for data tasks:

Generate dqt checks from business rules:

Read wiki/metrics/weekly_revenue.md and the thresholds in raw/tickets/GTM-001/.
Write dqt checks in YAML that would have caught the data issues described in the tickets.

Explain an incident:

Read wiki/lineage/causality.md and wiki/datasets/gigler_transactions.md.
The null_fraction on amount_usd spiked to 12% this morning.
Which upstream tables should I check first?

Onboard a new analyst:

Read all files in wiki/datasets/ and wiki/metrics/.
Write a 1-page data dictionary for the platform domain, suitable for a new analyst
who has never seen the warehouse.

Audit PII coverage:

Read all files in raw/columns/.
List every column where pii: true and which tickets reference them.
Are there any columns that sound like PII but aren't flagged?

Python API reference

from dqt.semantic import (
    SemanticManifest,
    DatasetDescription,
    ColumnDescription,
    load_semantic_manifest,
)
from dqt.lineage import (
    LineageGraph,
    LineageNode,
    LineageEdge,
    write_vault,
)

# Load from YAML
manifest = load_semantic_manifest("semantic.yaml")

# Or build programmatically
manifest = SemanticManifest(
    version="1",
    datasets=[
        DatasetDescription(
            id="gigler_transactions",
            description="Transaction records from the Gigler marketplace.",
            owner="platform-analytics@gigler.com",
            domain="platform",
            freshness_sla_hours=4,
            columns=[
                ColumnDescription(
                    name="amount_usd",
                    description="Gross transaction value in USD.",
                    classification="internal",
                    pii=False,
                    unit="USD",
                ),
            ],
        ),
    ],
)

# Access programmatically
ds = manifest.get_dataset("gigler_transactions")
col = manifest.get_column("gigler_transactions", "amount_usd")

# Build lineage graph
graph = LineageGraph(
    nodes=[
        LineageNode(id="gigler_transactions", kind="dataset", label="Transactions"),
        LineageNode(id="weekly_transaction_volume", kind="metric", label="Weekly Volume",
                    meta={"description": "Sum of completed transactions per week",
                          "unit": "count", "domain": "platform"}),
    ],
    edges=[
        LineageEdge(
            source="gigler_transactions.amount_usd",
            target="weekly_transaction_volume",
            kind="aggregates",
        ),
    ],
)

# Write vault
write_vault(manifest, graph, vault_dir="vault/", vault_title="My Knowledge Graph")