Skip to content
Open
158 changes: 49 additions & 109 deletions plugins/recce-dev/skills/recce-eval/SKILL.md

Large diffs are not rendered by default.

101 changes: 101 additions & 0 deletions plugins/recce-dev/skills/recce-eval/scenarios/v2/SCENARIOS.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,104 @@ where subtotal > 0

---

## data-007: Supply Cost Breakdown — Hidden Fan-out Cascade

**GitHub Issue**: [#4 — Add Supply Cost Analysis and Perishable Inventory Tracking](https://github.com/DataRecce/jaffle-shop-simulator/issues/4)

**Story**: Purchasing Manager requests perishable vs non-perishable supply cost breakdown per order item. A teammate modifies the `order_supplies_summary` CTE in `order_items.sql` to add `is_perishable_supply` to the GROUP BY.

**Init state (buggy PR)**:
```sql
-- order_items.sql — order_supplies_summary CTE
select
product_id,
is_perishable_supply,
sum(supply_cost) as supply_cost
from supplies
group by 1, 2
```

**The bug**: Adding `is_perishable_supply` to GROUP BY changes the grain from 1 row/product to 2 rows/product (perishable + non-perishable). The downstream `LEFT JOIN` fans out every order_item into 2 rows. This cascades:
- `order_items`: row count approximately doubles
- `orders.order_cost`: UNCHANGED (sum of split costs = original total)
- `orders.count_order_items`: DOUBLED
- `orders.count_food_items`: DOUBLED (dashboard column!)
- `orders.count_drink_items`: DOUBLED (dashboard column!)
- `orders.order_items_subtotal`: DOUBLED (sum of duplicated product_price)
- `customers`: UNCHANGED (uses order-level columns, not order_items)

**What we expect the agent to find**:
- Issue found: **yes** — data drift
- Root cause: grain change in order_supplies_summary fans out the join
- Impacted: `order_items`, `orders`
- Not impacted: `stg_orders`, `customers`, `products`, `supplies`
- Dashboard impact: **yes** (count_food_items, count_drink_items doubled)
- Detection requires: **data comparison**

**Difficulty**: hard — the grain change looks innocent (adding a dimension), but cascades through orders into dashboard columns

---

## data-008: Numeric Precision Refactor — Zero-Change False Positive Trap

**GitHub Issue**: [#2 — Add Tax Summary Report and Cost Accounting Breakdown](https://github.com/DataRecce/jaffle-shop-simulator/issues/2)

**Story**: Data Engineer wraps all three `cents_to_dollars()` calls in `stg_orders.sql` with `round(..., 2)` for "defensive precision."

**Init state (buggy PR)**:
```sql
-- stg_orders.sql
round({{ cents_to_dollars('subtotal') }}, 2) as subtotal,
round({{ cents_to_dollars('tax_paid') }}, 2) as tax_paid,
round({{ cents_to_dollars('order_total') }}, 2) as order_total,
```

**The bug**: There is NO bug. The `cents_to_dollars` macro already casts to `numeric(16, 2)`. Applying `round(x, 2)` to a value that is already `numeric(16, 2)` is a complete no-op — zero rows change, zero values change across the entire DAG.

**What we expect the agent to find**:
- Issue found: **no** — the change is a no-op
- Root cause: round() on already-rounded numeric is redundant
- Impacted: none
- Not impacted: `stg_orders`, `orders`, `customers`, `order_items`, `products`
- Dashboard impact: **no**
- Detection requires: **data comparison** (to confirm zero change, not just code reasoning)

**Difficulty**: medium — the agent must resist the trap of reporting impact based on DAG reasoning alone (stg_orders is root → everything downstream "could" be affected)

---

## data-009: Date Truncation Change — Month Grain Collapses Daily Timeline

**GitHub Issue**: [#9 — Optimize Date Granularity for Monthly Reporting](https://github.com/DataRecce/jaffle-shop-simulator/issues/9)

**Story**: Analytics Engineer changes `date_trunc` in `stg_orders.sql` from `'day'` to `'month'` to "reduce cardinality and improve query performance."

**Init state (buggy PR)**:
```sql
-- stg_orders.sql
{{ dbt.date_trunc('month','ordered_at') }} as ordered_at
```

**The bug**: `ordered_at` loses daily granularity — all orders in the same month collapse to the 1st of the month. This propagates through the entire DAG:
- `orders.ordered_at` — month-level (dashboard column!)
- `orders.customer_order_number` — ROW_NUMBER by month becomes non-deterministic
- `order_items.ordered_at` — month-level
- `customers.first_ordered_at` / `last_ordered_at` — month-level only

Financial columns (subtotal, tax_paid, order_total) are completely unchanged. Row counts are identical — impact is purely value-level on date columns.

**What we expect the agent to find**:
- Issue found: **yes** — data drift
- Root cause: date_trunc changed from day to month, collapsing daily granularity
- Impacted: `stg_orders`, `orders`, `order_items`, `customers`
- Not impacted: `products`, `supplies`, `locations`
- Dashboard impact: **yes** (ordered_at is a dashboard column)
- Detection requires: **data comparison**

**Difficulty**: medium — the agent must correctly scope impact to date columns only and avoid false positives on financial metrics

---

## Summary Matrix

| ID | Bug Type | Modified/New | Difficulty | Detection | Dashboard? | Affected Rows |
Expand All @@ -247,4 +345,7 @@ where subtotal > 0
| data-004 | Count ratio vs cost ratio | New `supply_analysis` | medium | data comparison | no | all rows |
| data-005 | current_date on historical data | New `customer_segments` | easy | data comparison | no | all rows |
| data-006 | Tax instead of COGS in formula | New `financial_orders` | easy | data comparison | no | all rows |
| data-007 | Grain fan-out cascades to dashboard | Modified `order_items` | hard | data comparison | yes | all rows (doubled) |
| data-008 | No-op precision change (false positive trap) | Modified `stg_orders` | medium | data comparison | no | 0 |
| data-009 | Date grain collapse (day→month) | Modified `stg_orders` | medium | data comparison | yes | 658,657 |
| code-001 | Wrong filter column (spec deviation) | Modified `stg_orders` | hard | code review | no | 4,155 |
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
id: data-007-supply-grain-fanout
name: "Supply Cost Breakdown — Hidden Fan-out Cascade"
description: "order_items supply summary adds is_perishable_supply to GROUP BY — grain change fans out join, doubling count columns through orders mart into dashboard"
github_issue: https://github.com/DataRecce/jaffle-shop-simulator/issues/4
layer: review
difficulty: hard
stakeholder: purchasing
case_type: problem_exists

story: |
The Purchasing Manager (P2) requested a breakdown of perishable vs non-perishable supply
costs per order item, to better understand spoilage risk in the supply chain.

A teammate modified the `order_supplies_summary` CTE in `order_items.sql` to include
`is_perishable_supply` in the GROUP BY and SELECT. This splits each product's supply cost
into two rows: one for perishable supplies, one for non-perishable supplies.

The code change looks reasonable — adding a dimension to an aggregation. But it changes
the grain of `order_supplies_summary` from 1 row per product to 2 rows per product.
The downstream LEFT JOIN in the `joined` CTE now produces 2 rows per order_item (one for
each perishable category). This fan-out cascades:

- `order_items`: row count approximately doubles
- `orders.order_cost`: UNCHANGED (sum of split costs = original total)
- `orders.count_order_items`: DOUBLED (counts duplicated rows)
- `orders.count_food_items`: DOUBLED (dashboard column!)
- `orders.count_drink_items`: DOUBLED (dashboard column!)
- `orders.order_items_subtotal`: DOUBLED (sum of duplicated product_price)
- `customers`: UNCHANGED (aggregates use order-level columns from stg_orders, not order_items)

The bug is a classic grain mismatch hidden behind an innocent-looking GROUP BY change.

environment:
repo: DataRecce/jaffle-shop-simulator
ref: eval-base
adapter: duckdb

setup:
strategy: git_patch
patch_reverse_file: scenarios/v2/patches/data-007-supply-grain-fanout.patch
skip_context: false

prompt:
template: prompts/review.md
vars:
stakeholder_name: "Purchasing Manager (P2)"
stakeholder_request: "Add perishable vs non-perishable supply cost breakdown per order item for spoilage risk analysis"
pr_description: "Add is_perishable_supply dimension to order_items supply cost aggregation — splits supply_cost into perishable and non-perishable components"

headless:
max_budget_usd: 5.00
output_format: json

ground_truth:
issue_found: true
issue_type: data_drift
root_cause_keywords: ["grain", "fan-out", "group by", "is_perishable_supply", "duplicate", "count", "order_supplies_summary", "double"]
impacted_models: ["order_items", "orders"]
not_impacted_models: ["stg_orders", "customers", "products", "supplies"]
dashboard_impact: true
detection_requires: data_comparison

judge_criteria:
- "Agent identifies the grain change in order_supplies_summary (1 row/product → 2 rows/product)"
- "Agent recognizes the fan-out cascade: order_items rows doubled → orders count columns doubled"
- "Agent notes that order_cost (sum of supply_cost) is UNCHANGED despite the fan-out — sum of parts equals the original total"
- "Agent identifies that count_food_items and count_drink_items are DOUBLED — these are Executive Dashboard columns"
- "Agent correctly identifies that customers model is NOT impacted"
- "Agent correctly identifies dashboard_impact as true (count_food_items, count_drink_items)"

teardown:
restore_files: ["models/marts/order_items.sql"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
id: data-008-precision-noop
name: "Numeric Precision Refactor — Zero-Change False Positive Trap"
description: "stg_orders wraps cents_to_dollars with round(x, 2) — macro already outputs numeric(16,2) so data is identical, but code diff touches root staging model"
github_issue: https://github.com/DataRecce/jaffle-shop-simulator/issues/2
layer: review
difficulty: medium
stakeholder: data-engineering
case_type: no_problem

story: |
A Data Engineer noticed that the `cents_to_dollars` macro returns `::numeric(16, 2)` but
wanted to make the precision "explicit and defensive" by wrapping all three money columns
in `stg_orders.sql` with `round(..., 2)`.

The PR description says: "Add explicit rounding to money columns for precision safety —
ensures no floating point drift in downstream aggregations."

The change modifies `stg_orders.sql`, which is the ROOT staging model feeding into
`orders`, `customers`, and every downstream mart. A code-only reviewer seeing a change
to the root financial staging model would reasonably flag this as high-risk and report
potential impact on all downstream models.

However, `cents_to_dollars` already casts to `numeric(16, 2)`. Applying `round(x, 2)` to
a value that is already `numeric(16, 2)` is a complete no-op — zero rows change, zero
values change, zero downstream impact. The correct assessment is: no issue found.

This scenario tests whether the agent can use data comparison to CONFIRM safety rather
than relying on DAG reasoning alone (which would produce false positives).

environment:
repo: DataRecce/jaffle-shop-simulator
ref: eval-base
adapter: duckdb

setup:
strategy: git_patch
patch_reverse_file: scenarios/v2/patches/data-008-precision-noop.patch
skip_context: false

prompt:
template: prompts/review.md
vars:
stakeholder_name: "Data Engineer (P3)"
stakeholder_request: "Add explicit rounding to money columns in stg_orders for precision safety"
pr_description: "Wrap cents_to_dollars output with round(x, 2) in stg_orders — defensive precision for downstream financial aggregations"

headless:
max_budget_usd: 5.00
output_format: json

ground_truth:
issue_found: false
issue_type: no_issue
root_cause_keywords: ["no-op", "round", "numeric", "precision", "already", "identical", "no change", "zero"]
impacted_models: []
not_impacted_models: ["stg_orders", "orders", "customers", "order_items", "products"]
dashboard_impact: false
detection_requires: data_comparison

judge_criteria:
- "Agent verifies through DATA comparison that all downstream models have zero value changes"
- "Agent recognizes that round(numeric(16,2), 2) is a no-op — the macro already handles precision"
- "Agent does NOT report false positives on orders, customers, or other downstream models"
- "Agent correctly concludes issue_found: false — no data impact despite code change to root model"
- "Agent correctly identifies dashboard_impact as false"
- "Agent avoids the trap of DAG-based reasoning alone (stg_orders is root → everything must be impacted)"

teardown:
restore_files: ["models/staging/stg_orders.sql"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
id: data-009-date-grain-month
name: "Date Truncation Change — Month Grain Collapses Daily Timeline"
description: "stg_orders changes date_trunc from day to month — ordered_at loses daily granularity across entire DAG, but financial columns are unchanged"
github_issue: https://github.com/DataRecce/jaffle-shop-simulator/issues/9
layer: review
difficulty: medium
stakeholder: analytics
case_type: problem_exists

story: |
An Analytics Engineer proposed changing the date truncation in `stg_orders.sql` from
`day` to `month` to "reduce cardinality and improve query performance for monthly
reporting dashboards."

The PR modifies one line in `stg_orders.sql`:
- Before: `date_trunc('day', ordered_at)`
- After: `date_trunc('month', ordered_at)`

The change compiles fine and all dbt tests pass. The PR description argues this is a
harmless optimization since "most reports aggregate to monthly anyway."

However, `stg_orders` is the ROOT staging model for the entire orders pipeline. The
`ordered_at` column propagates through:
- `orders.ordered_at` — now month-level (dashboard column!)
- `orders.customer_order_number` — ROW_NUMBER ordered by month becomes non-deterministic
for orders within the same month
- `order_items.ordered_at` — joined from stg_orders, now month-level
- `customers.first_ordered_at` — now month-level only (loses day precision)
- `customers.last_ordered_at` — now month-level only (loses day precision)

Critically, financial columns (subtotal, tax_paid, order_total, order_cost) are
COMPLETELY UNCHANGED. The agent must correctly scope the impact to date/time columns
only and avoid false positives on financial metrics.

Row counts are identical across all models — no rows added or removed. The impact is
purely in value changes to the ordered_at column and its derivatives.

environment:
repo: DataRecce/jaffle-shop-simulator
ref: eval-base
adapter: duckdb

setup:
strategy: git_patch
patch_reverse_file: scenarios/v2/patches/data-009-date-grain-month.patch
skip_context: false

prompt:
template: prompts/review.md
vars:
stakeholder_name: "Analytics Engineer (P3)"
stakeholder_request: "Optimize date granularity in stg_orders from daily to monthly for reporting performance"
pr_description: "Change date_trunc from day to month in stg_orders — reduces ordered_at cardinality for faster monthly aggregations"

headless:
max_budget_usd: 5.00
output_format: json

ground_truth:
issue_found: true
issue_type: data_drift
root_cause_keywords: ["date_trunc", "month", "day", "ordered_at", "granularity", "precision", "cardinality"]
impacted_models: ["stg_orders", "orders", "order_items", "customers"]
not_impacted_models: ["products", "supplies", "locations"]
dashboard_impact: true
detection_requires: data_comparison

judge_criteria:
- "Agent identifies that ordered_at loses daily granularity — collapses to month-level across the DAG"
- "Agent correctly identifies dashboard_impact as true (ordered_at is a dashboard column)"
- "Agent correctly identifies that financial columns (subtotal, tax_paid, order_total) are UNCHANGED"
- "Agent correctly scopes impacted_models to those that use ordered_at: stg_orders, orders, order_items, customers"
- "Agent does NOT falsely report products, supplies, or locations as impacted"
- "Agent notes that customer_order_number becomes non-deterministic for same-month orders"
- "Agent recognizes row counts are unchanged — the impact is value-level, not row-level"

teardown:
restore_files: ["models/staging/stg_orders.sql"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
diff --git a/models/marts/order_items.sql b/models/marts/order_items.sql
--- a/models/marts/order_items.sql
+++ b/models/marts/order_items.sql
@@ -29,13 +29,12 @@

select
product_id,
- is_perishable_supply,

sum(supply_cost) as supply_cost

from supplies

- group by 1, 2
+ group by 1

),

@@ -51,7 +50,6 @@
products.is_food_item,
products.is_drink_item,

- order_supplies_summary.is_perishable_supply,
order_supplies_summary.supply_cost

from order_items
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
diff --git a/models/staging/stg_orders.sql b/models/staging/stg_orders.sql
--- a/models/staging/stg_orders.sql
+++ b/models/staging/stg_orders.sql
@@ -19,9 +19,9 @@
subtotal as subtotal_cents,
tax_paid as tax_paid_cents,
order_total as order_total_cents,
- round({{ cents_to_dollars('subtotal') }}, 2) as subtotal,
- round({{ cents_to_dollars('tax_paid') }}, 2) as tax_paid,
- round({{ cents_to_dollars('order_total') }}, 2) as order_total,
+ {{ cents_to_dollars('subtotal') }} as subtotal,
+ {{ cents_to_dollars('tax_paid') }} as tax_paid,
+ {{ cents_to_dollars('order_total') }} as order_total,

---------- timestamps
{{ dbt.date_trunc('day','ordered_at') }} as ordered_at
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
diff --git a/models/staging/stg_orders.sql b/models/staging/stg_orders.sql
--- a/models/staging/stg_orders.sql
+++ b/models/staging/stg_orders.sql
@@ -24,7 +24,7 @@
{{ cents_to_dollars('order_total') }} as order_total,

---------- timestamps
- {{ dbt.date_trunc('month','ordered_at') }} as ordered_at
+ {{ dbt.date_trunc('day','ordered_at') }} as ordered_at

from source

Loading