Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 206 additions & 1 deletion .opencode/skills/dbt-develop/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,30 @@
---
name: dbt-develop
description: Create and modify dbt models — staging, intermediate, marts, incremental, medallion architecture. Use when building new SQL models, extending existing ones, scaffolding YAML configs, or reorganizing project structure. Powered by altimate-dbt.
applyPaths:
- "dbt_project.yml"
- "**/dbt_project.yml"
description: |
REQUIRED before writing or modifying ANY dbt model. Invoke this skill FIRST
whenever a task says "create", "build", "add", "modify", "update", "fix", or
"refactor" a dbt model, staging file, mart, incremental, or snapshot.

Skipping this skill is the leading cause of silent-correctness bugs —
models that compile and `dbt build` cleanly but produce wrong values. It
contains the patterns that prevent the most common such bugs encountered
in real dbt projects:

• Incremental high-water marks (`>=` vs `>` ties → silent row dropout)
• Snapshot strategy selection (timestamp vs check, `unique_key` choice)
• `LEFT JOIN + COUNT(*)` phantom rows from unmatched parents
• Type harmonization in `COALESCE` / `CASE` / `UNION` legs
• Date-spine completeness (every period present, even empty ones)
• Off-by-one window boundaries (`BETWEEN d - (N-1) AND d` for N-wide)
• Uniqueness enforcement when schema implies a key
• Window-function `LIMIT` with deterministic tiebreaker
• Verifying transformation correctness with dbt unit tests, not just `dbt build`
• Enumerating every requested deliverable and checking each exists on disk

Do not start writing SQL until this skill is loaded. Powered by altimate-dbt.
---

# dbt Model Development
Expand Down Expand Up @@ -31,6 +55,12 @@ description: Create and modify dbt models — staging, intermediate, marts, incr

Before writing any SQL:
- Read the task requirements carefully
- **Enumerate every concrete deliverable the task asks for** — write down each
model name, every column/test/config change mentioned, and any "create N
models" count. This list becomes the checklist you verify against in
step 4. A task asking for four models is not done if only three exist on
disk. If the task references a `schema.yml`, `_models.yml`, or similar
spec file, every entry there is a deliverable.
- Identify which layer this model belongs to (staging, intermediate, mart)
- Check existing models for naming conventions and patterns
- **Check dependencies:** If `packages.yml` exists, check for `dbt_packages/` or `package-lock.yml`. Only run `dbt deps` if packages are declared but not yet installed.
Expand Down Expand Up @@ -98,6 +128,44 @@ altimate-dbt compile --model <name> # catch Jinja errors
altimate-dbt build --model <name> # materialize + run tests
```

**Verify transformation correctness with unit tests:**

For models with non-trivial transformation logic — aggregations, JOINs, CASE/WHEN,
window functions, ratio / rate / NPS calculations, COALESCE / NULL coalescing, date
spines, incremental merge keys — generate and run dbt unit tests before declaring
the model done. Schema checks ("table exists with the right columns") only verify
mechanics; value-level correctness needs unit tests.

Invoke the **dbt-unit-tests** skill, which will:
- Analyze your SQL for the constructs above
- Build typed mock input rows from the manifest
- Compute expected outputs by running the SQL against the mocks
- Write a `unit_tests:` block in the model's `_models.yml`

Then run them:
```bash
altimate-dbt test --model <name> # runs unit tests + schema tests
```

If a unit test fails, the transformation logic is wrong — **fix the SQL, do not
weaken the test**. Skip unit tests only for genuinely trivial models: pure renames,
simple `SELECT *` passthrough, materialization / config-only changes, format-only
edits.

**Verify every requested deliverable exists:**

Walk the checklist you wrote in the Plan step. For each model the task asked
for, confirm: (1) the `.sql` file exists in the project, (2) it appears in
`altimate-dbt info` / the manifest, (3) `altimate-dbt columns --model <name>`
returns the expected columns, (4) the materialization config matches the
spec. A task that asked for N models is not complete with N-1 files on disk,
even if those N-1 build cleanly. Use:

```bash
ls models/ # confirm every requested file exists
altimate-dbt info # confirm every requested model is in the project
```
Comment on lines +165 to +167
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use recursive file discovery instead of ls models/ for deliverable verification.

ls models/ only shows top-level entries, so nested model files can be missed during checklist validation. Prefer a recursive check command.

Suggested doc tweak
-ls models/                                                   # confirm every requested file exists
+find models -type f -name "*.sql"                           # confirm every requested model file exists (including nested dirs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.opencode/skills/dbt-develop/SKILL.md around lines 165 - 167, Replace the
non-recursive "ls models/" check with a recursive file-discovery command so
nested model files aren't missed; update the SKILL.md checklist to use a
recursive listing (e.g., recursive ls or find) targeting model files (reference
the current "ls models/" line) and ensure the new command filters for model file
types so deliverable verification includes files in subdirectories.


**Verify the output:**
```bash
altimate-dbt columns --model <name> # confirm expected columns exist
Expand Down Expand Up @@ -127,6 +195,142 @@ Use `altimate-dbt children` and `altimate-dbt parents` to verify the DAG is inta
3. **Match existing patterns.** Read 2-3 existing models in the same directory before writing.
4. **One model, one purpose.** A staging model should not contain business logic. An intermediate model should not be materialized as a table unless it has consumers.
5. **Fix ALL errors, not just yours.** After creating/modifying models, run a full `dbt build`. If ANY model fails — even pre-existing ones you didn't touch — fix them. Your job is to leave the project in a fully working state.
6. **Verify transformation correctness, not just mechanics.** For non-trivial models, generate and run dbt unit tests as part of the validate step (use the `dbt-unit-tests` skill). Passing `dbt build` only proves the SQL is syntactically valid — it doesn't prove the *values* are right.
7. **Enumerate deliverables, then check them off.** The task is not done until every model, column, test, and config change explicitly requested exists on disk and in the manifest. Re-read the prompt at the end and verify each requested item — don't trust your own intermediate "done" feeling.

## Common Pitfalls in Transformation Logic

When the model involves any of the following SQL constructs, watch for these
generic bugs that mostly compile cleanly but produce wrong values:

### Incremental models and snapshots

- **High-water mark boundary**: in the `{% if is_incremental() %}` filter, use
`>=` (not `>`) when the upstream timestamp can repeat or land exactly on the
prior max — a strict `>` silently drops every event that ties with the most
recent prior load.
- **`unique_key` choice**: must be the *natural* unique key of the row. Picking
a column that is not actually unique (e.g. a foreign-key like `customer_id`
instead of `order_id`) causes silent merges and lost rows.
- **`on_schema_change`**: set `append_new_columns` (or `sync_all_columns` if
upstream evolves) so a new source column doesn't NULL-out existing data.
- **Snapshots — strategy selection**: use `strategy='timestamp'` only when the
source has a reliable `updated_at` that monotonically increases on every
change. If `updated_at` can be NULL, be reset, or move backwards, switch to
`strategy='check'` with an explicit `check_cols` list. Verify by querying
the source for `MAX(updated_at)` and looking for repeats or NULLs.
- **Backfilling**: `--full-refresh` rebuilds incremental tables from scratch.
Use it whenever you change the incremental SQL, the merge key, or
`on_schema_change`.

### Date and time arithmetic

- **"current age", "days since", "elapsed", "tenure"** — if the column is not
pre-computed in the source, compute it. For year-based age, account for
month/day so the change happens on the birthday, not on Jan 1:
```sql
date_part('year', age(birth_date)) -- in postgres-family
EXTRACT(YEAR FROM CURRENT_DATE) - EXTRACT(YEAR FROM birth_date)
- CASE WHEN (EXTRACT(MONTH FROM CURRENT_DATE), EXTRACT(DAY FROM CURRENT_DATE))
< (EXTRACT(MONTH FROM birth_date), EXTRACT(DAY FROM birth_date))
THEN 1 ELSE 0 END -- portable form
```
- **Date spines**: when a daily/weekly/monthly model must have a row for
every period (even periods with zero events), build a spine first with
`dbt_utils.date_spine` or a recursive CTE, then LEFT JOIN the events onto
it. Never compute date series by `DISTINCT date_col FROM events` — that
silently drops empty periods.
- **Date boundaries for windowed sums**: rolling-N-day windows expressed as
`BETWEEN d - (N-1) AND d` (inclusive both ends) give a width of exactly N.
`BETWEEN d - N AND d` gives N+1 — a classic off-by-one.

### Type harmonization in `COALESCE` / `CASE` / `UNION`

`COALESCE(timestamp_col, integer_col)` and `CASE WHEN ... THEN '0' ELSE 0 END`
fail at compile or coerce silently to whatever type the engine guesses.
Cast every branch / argument to the same explicit type:
```sql
COALESCE(CAST(timestamp_col AS TIMESTAMP), CAST(integer_col AS TIMESTAMP))
CASE WHEN cond THEN CAST('0' AS NUMERIC) ELSE CAST(0 AS NUMERIC) END
```
Same applies to `UNION` / `UNION ALL` — column types must match across legs.

### String concatenation with `NULL` operands

`||` and `CONCAT()` propagate `NULL` in most engines — a single `NULL` operand
makes the whole expression `NULL`. When the result feeds an equality join or
surrogate-key generation, that's an invisible row-dropper:
```sql
-- Wrong: NULL region OR NULL segment produces NULL geo_segment
region || '-' || segment AS geo_segment

-- Right: explicit placeholder
COALESCE(region, 'UNKNOWN') || '-' || COALESCE(segment, 'UNKNOWN') AS geo_segment
```
Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
skips `NULL` operands instead of propagating them, which is usually safer
Comment on lines +270 to +271
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: CONCAT_WS support/behavior is documented incorrectly: BigQuery does not support CONCAT_WS, and Snowflake CONCAT_WS does not skip NULLs. This guidance can produce failing or incorrect SQL in both dialects.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .opencode/skills/dbt-develop/SKILL.md, line 270:

<comment>`CONCAT_WS` support/behavior is documented incorrectly: BigQuery does not support `CONCAT_WS`, and Snowflake `CONCAT_WS` does not skip NULLs. This guidance can produce failing or incorrect SQL in both dialects.</comment>

<file context>
@@ -252,6 +255,44 @@ CASE WHEN cond THEN CAST('0' AS NUMERIC) ELSE CAST(0 AS NUMERIC) END
+-- Right: explicit placeholder
+COALESCE(region, 'UNKNOWN') || '-' || COALESCE(segment, 'UNKNOWN') AS geo_segment
+```
+Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
+skips `NULL` operands instead of propagating them, which is usually safer
+than a static placeholder.
</file context>
Suggested change
Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
skips `NULL` operands instead of propagating them, which is usually safer
Use dialect-specific NULL-safe concatenation patterns. In BigQuery, use `ARRAY_TO_STRING([...], '-')` to skip `NULL`s; in Snowflake, `CONCAT_WS` still returns `NULL` when any argument is `NULL`, so wrap operands with `COALESCE(...)`.

Tip: Review your code locally with the cubic CLI to iterate faster.

than a static placeholder.
Comment on lines +270 to +272
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

`Verify SQL dialect docs:

  1. Snowflake CONCAT_WS NULL behavior
  2. BigQuery GoogleSQL support for CONCAT_WS vs CONCAT/ARRAY_TO_STRING`

💡 Result:

  1. Snowflake CONCAT_WS NULL behavior Snowflake’s CONCAT_WS does not skip NULL values. If any argument to CONCAT_WS is NULL, the function returns NULL. [1] 2) BigQuery (GoogleSQL) support: CONCAT_WS vs CONCAT/ARRAY_TO_STRING BigQuery GoogleSQL does not provide a CONCAT_WS function. [2] BigQuery’s CONCAT returns NULL if any input argument is NULL. [3] To get CONCAT_WS-like “join with delimiter while omitting NULLs”, BigQuery’s ARRAY_TO_STRING is the documented alternative: ARRAY_TO_STRING omits NULL array elements when null_text is not supplied (so it does not emit the delimiter for NULLs). [4] Example (conceptual): ARRAY_TO_STRING(['foo', NULL, 'bar'], '#') produces 'foo#bar' because NULLs are omitted and their preceding delimiter is omitted when null_text is not used. [4] So: Snowflake CONCAT_WS = NULL in any position => NULL result; BigQuery has no CONCAT_WS, and to emulate “skip NULLs with delimiter” you typically use ARRAY_TO_STRING (where NULLs are omitted unless you provide null_text). [1][3][4]

Citations:


Fix inaccurate CONCAT_WS NULL handling guidance in documentation.

Lines 270–272 claim CONCAT_WS() is supported in Snowflake and BigQuery while skipping NULL operands. However:

  • Snowflake CONCAT_WS propagates NULL (returns NULL if any argument is NULL)
  • BigQuery does not provide CONCAT_WS; use ARRAY_TO_STRING instead for NULL-omitting behavior

This misguidance risks silent NULL propagation bugs in generated SQL. Replace with explicit dialect-safe guidance recommending COALESCE for operands or verification of adapter-specific NULL semantics before relying on any concat function.

Suggested doc fix
-Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
-skips `NULL` operands instead of propagating them, which is usually safer
-than a static placeholder.
+Use dialect-safe null handling explicitly. In many engines, string concat
+propagates `NULL` unless you `COALESCE` each operand first.
+If you choose `CONCAT_WS`, verify your adapter's NULL semantics in docs
+before relying on it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.opencode/skills/dbt-develop/SKILL.md around lines 270 - 272, Update the
documentation guidance about CONCAT_WS: remove the blanket claim that CONCAT_WS
skips NULLs in Snowflake and BigQuery and instead state explicit, dialect-safe
advice — note that Snowflake's CONCAT_WS propagates NULLs, BigQuery lacks
CONCAT_WS (use ARRAY_TO_STRING for NULL-omitting behavior), and recommend using
COALESCE on operands or validating the adapter-specific NULL semantics before
relying on any concat function (mention CONCAT_WS, ARRAY_TO_STRING, COALESCE by
name to help locate the reference).


### dbt model versioning (dbt 1.8+)

When the task asks for a v2 of an existing model (and v1 must keep
working — common during a rolling schema change), use dbt's **versioned
models** feature, not a sibling `.sql` file with a `_v2` suffix:

1. Create the new SQL file (e.g. `dim_accounts_v2.sql`).
2. Add a `versions:` block to the model's entry in `_models.yml`:
```yaml
models:
- name: dim_accounts
latest_version: 1
versions:
- v: 1
- v: 2
defined_in: dim_accounts_v2 # filename without .sql
```
3. Downstream callers reference the version with
`{{ ref('dim_accounts', v=2) }}`. Without the `versions:` block, dbt
treats `dim_accounts_v2` as an unrelated sibling model — versioning
tests will fail and v1↔v2 lineage won't appear in the DAG.

### Uniqueness when the schema implies it

If the model is named `dim_*`, has a `unique` test in `schema.yml`, or the
task says "one row per X", the model must enforce that grain. Source data
often has duplicates. Use one of:
- `SELECT DISTINCT ...`
- `QUALIFY ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY <tiebreaker>) = 1`
- `GROUP BY <key>` with explicit aggregation of all other columns

### Window functions / ranking with `LIMIT` and ties

`ORDER BY metric DESC LIMIT N` (and equivalently `ROW_NUMBER() / RANK() OVER
(PARTITION BY ... ORDER BY metric)` filtered to `<= N`) over a column with
ties returns a **non-deterministic** set — the engine can pick any N of the
tied rows, and the choice often differs across runs, engines, or warehouse
versions. The rest of the pipeline then sees row-count drift or different
keys appearing in downstream joins.

Always add a deterministic tiebreaker to the `ORDER BY` (a primary key, a
surrogate id, or any column guaranteed unique within the partition):
```sql
-- Wrong: ties produce different "top 20" every run
SELECT * FROM standings
ORDER BY points DESC
LIMIT 20

-- Right: tie on points falls back to driver_id
SELECT * FROM standings
ORDER BY points DESC, driver_id ASC
LIMIT 20

-- Same fix inside QUALIFY / window-row-number patterns:
QUALIFY ROW_NUMBER() OVER (
PARTITION BY season ORDER BY points DESC, driver_id ASC
) <= 20
```
If you can't think of a tiebreaker column, the model probably doesn't yet
have a unique key — fix that first.

## Common Mistakes

Expand All @@ -138,6 +342,7 @@ Use `altimate-dbt children` and `altimate-dbt parents` to verify the DAG is inta
| Creating a staging model with JOINs | Staging = 1:1 with source. JOINs belong in intermediate or mart |
| Not checking existing naming conventions | Read existing models in the same directory first |
| Using `SELECT *` in final models | Explicitly list columns for clarity and contract stability |
| `COUNT(*)` over a `LEFT JOIN` — counts unmatched parent rows as if they had one child (e.g. a `dim_listings LEFT JOIN fct_reviews` with no matching reviews still yields one row, so `COUNT(*) = 1` instead of `0`) | Use `COUNT(<child_key>)` or `COUNT(CASE WHEN <child_key> IS NOT NULL THEN 1 END)`. If you intended to exclude unmatched parents, switch to `INNER JOIN`. Same trap applies to `SUM`, `AVG`, etc. when the unmatched side contributes a "ghost" `NULL` row |

## Reference Guides

Expand Down
13 changes: 13 additions & 0 deletions .opencode/skills/dbt-unit-tests/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,19 @@ description: Generate dbt unit tests automatically for any model. Analyzes SQL l
3. **Use sql format for ephemeral models.** Dict format fails silently for ephemeral upstreams.
4. **Never weaken a test to make it pass.** If the test fails, the model logic may be wrong. Investigate before changing expected values.
5. **Compile before committing.** Always run `altimate-dbt test --model <name>` to verify tests compile and execute.
6. **Mock data MUST exercise the failure modes of every SQL construct in the model.** A unit test that only covers the happy path validates that the model handles easy inputs — it does not validate correctness. Before writing `given:` rows, list every SQL construct in the model and the boundary case it can mishandle, then ensure at least one mock row triggers each. Universal cases to always cover when the construct appears:
- **`LEFT JOIN` / `LEFT OUTER JOIN`** → at least one parent row with **no matching child** (catches `COUNT(*)` phantom rows, `SUM` over `NULL`, fan-out / dropout)
- **`INNER JOIN`** → at least one parent row whose child is filtered out by the JOIN condition (catches missing rows)
- **`COUNT(*)` / `COUNT(<col>)`** → row where the counted column is `NULL` (catches `COUNT(*)` vs `COUNT(col)` divergence)
- **`NULLIF(x, y)`** → row where `x = y` (so the result is `NULL`, exercising downstream `NULL`-handling)
- **`/` division** → row where the denominator is `0` or `NULL`
- **`CASE WHEN`** → at least one row matching each branch, including the implicit `ELSE NULL` if no explicit `ELSE` is set
- **`COALESCE` / `IFNULL`** → row where every argument is `NULL`
- **Window functions (`OVER`)** → an empty partition, a partition of size 1, and a row at the partition boundary
- **Date arithmetic / date spines** → a row at the start of range, end of range, and a gap day with no events
- **Aggregations with `GROUP BY`** → at least one group of size 1 (often masks fan-out bugs) and one group whose key is `NULL`
- **Incremental merge keys** → both an "insert" row and an "update" row matching an existing key
If you can't think of a failure mode for a construct, you don't yet understand it well enough to test it — read the SQL again before guessing inputs.

## Core Workflow: Analyze -> Generate -> Refine -> Validate -> Write

Expand Down
Loading
Loading