diff --git a/plugins/recce-dev/skills/recce-eval/SKILL.md b/plugins/recce-dev/skills/recce-eval/SKILL.md index fc21c18..2d1fcad 100644 --- a/plugins/recce-dev/skills/recce-eval/SKILL.md +++ b/plugins/recce-dev/skills/recce-eval/SKILL.md @@ -127,7 +127,7 @@ If the user selects nothing (cancels), **STOP**. ## Run Flow -This is the core orchestration — 12 steps that set up scenarios, run headless Claude Code, score results, and produce a report. +This is the core orchestration — 11 steps that set up scenarios, run headless Claude Code, score results, and produce a report. ### Step 1: Read Scenario(s) @@ -175,7 +175,7 @@ yq -o=json '{ }' "/.yaml" ``` -When `prompt_template` is non-null (v2), read the template file and substitute vars in Step 5. When `prompt_inline` is non-null (v1), use it directly as the prompt text. +When `prompt_template` is non-null (v2), `run-batch.sh` uses `render-prompt.py` with template+vars. When `prompt_inline` is non-null (v1), it substitutes runtime variables directly. ### Step 1b: Clone & Bootstrap v2 Project (v2 only) @@ -195,9 +195,9 @@ eval "$(bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/setup-v2-project.sh echo "PROJECT_DIR=$PROJECT_DIR" ``` -Record `PROJECT_DIR` — pass it as `--project-dir "$PROJECT_DIR"` to all `run-case.sh` invocations in Step 7. +Record `PROJECT_DIR` — pass it as `--project-dir "$PROJECT_DIR"` to `run-batch.sh` in Step 6. -**Cleanup**: At the very end of the Run Flow (after Step 12), remove the temp project: +**Cleanup**: At the very end of the Run Flow (after Step 11), remove the temp project: ```bash if [ -n "$WORK_DIR" ] && [[ "$WORK_DIR" == "${TMPDIR:-/tmp}"* ]]; then @@ -263,23 +263,7 @@ echo "BATCH_DIR=$BATCH_DIR" Record `EVAL_ID` and `BATCH_DIR` for later steps. `BATCH_DIR` is always absolute and anchored to the invoking CWD, so eval output survives v2 temp project cleanup. -### Step 5: Prepare Prompt - -Build the prompt text for each scenario, then write to a temp file. - -**v2 (template+vars):** Read the template file from `prompt_template` (relative to `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/`), then substitute `{variables}` with values from `prompt_vars` and runtime values (`{target}`, `{adapter_description}`). - -**v1 (inline prompt):** Use the `prompt_inline` string directly, substituting only runtime values (`{target}`, `{adapter_description}`). - -```bash -PROMPT_FILE="/tmp/recce-eval-prompt-${EVAL_ID}-${SCENARIO_ID}.txt" -cat > "$PROMPT_FILE" << 'PROMPT_EOF' - -PROMPT_EOF -echo "PROMPT_FILE=$PROMPT_FILE" -``` - -### Step 6: Generate Eval MCP Config +### Step 5: Generate Eval MCP Config Create a temporary MCP config JSON using **stdio** transport for Recce MCP. This avoids DuckDB lock conflicts — claude spawns the MCP server as a child process after run-case.sh setup completes, so `dbt run` in setup never competes for the database lock. @@ -307,98 +291,53 @@ echo "MCP_CONFIG=/tmp/recce-eval-mcp-config.json" **Why `--strict-mcp-config`**: The `--mcp-config` flag is additive and its merge behavior with plugin `.mcp.json` for same-name keys is undocumented. Using `--strict-mcp-config` guarantees the eval config is the sole MCP source. -### Step 7: Interleaved Run Loop +### Step 6: Run Eval Batch -Set `NO_BARE` based on whether the user passed `--no-bare`: -- If `--no-bare` was passed: `NO_BARE=true` (passes `--no-bare --no-clean-profile` to `run-case.sh`, uses OAuth auth) -- Otherwise: `NO_BARE=""` (default `--bare` mode, requires `ANTHROPIC_API_KEY`) +Run all scenarios with both variants using `run-batch.sh`. This script encapsulates prompt rendering, the interleaved run loop, and deterministic scoring into a single background-capable command. -Run each scenario with both variants in interleaved order. For N runs, the execution order is: baseline run1 → with-plugin run1 → baseline run2 → with-plugin run2 → ... This reduces systematic bias from cache warming or temporal effects. +Build a comma-separated list of absolute scenario file paths from the scenarios parsed in Step 1. Determine the `ADAPTER_DESC` string from the adapter detected in Step 2: -For each run number (1 to N), for each variant (`baseline` first, then `with-plugin`): +| Adapter | `ADAPTER_DESC` | +|---------|---------------| +| duckdb | `DuckDB (local file database, target: $TARGET)` | +| snowflake | `Snowflake (cloud data warehouse, target: $TARGET)` | ```bash -# Create scenario output dir -mkdir -p "$BATCH_DIR/$SCENARIO_ID" - -# ---- Baseline variant ---- -# --bare is default: no memory, no CLAUDE.md, pure prompt-driven evaluation -# When user passes --no-bare: add --no-bare --no-clean-profile (uses OAuth, no API key needed) -bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/run-case.sh \ - --id "$SCENARIO_ID" \ - --case-type "$CASE_TYPE" \ - --variant baseline \ - --prompt-file "$PROMPT_FILE" \ - --setup-strategy "$SETUP_STRATEGY" \ - --patch-file "${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/$PATCH_FILE" \ - --restore-files "$RESTORE_FILES" \ +bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/run-batch.sh \ + --scenarios "$SCENARIO_LIST" \ + --batch-dir "$BATCH_DIR" \ + --eval-id "$EVAL_ID" \ + --skill-dir "${CLAUDE_PLUGIN_ROOT}/skills/recce-eval" \ + --recce-plugin "$RECCE_PLUGIN_ROOT" \ --target "$TARGET" \ - --max-budget-usd "$MAX_BUDGET" \ - --output-dir "$BATCH_DIR/$SCENARIO_ID" \ - --run-number "$RUN_NUM" \ - ${NO_BARE:+--no-bare --no-clean-profile} \ - ${PROJECT_DIR:+--project-dir "$PROJECT_DIR"} -``` - -Parse the KEY=VALUE output from `run-case.sh`. Record `OUTPUT_FILE`, `JSON_EXTRACTED`, `TOTAL_COST_USD`, `DURATION_MS`. - -Immediately score the baseline run: - -```bash -bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/score-deterministic.sh \ - --run-file "$BATCH_DIR/$SCENARIO_ID/baseline_run${RUN_NUM}.json" \ - --case-type "$CASE_TYPE" \ - --ground-truth '$GROUND_TRUTH_JSON' -``` - -Then run the with-plugin variant: - -```bash -# ---- With-plugin variant ---- -# --bare is default; --plugin-dir injects the plugin even in bare mode -# When user passes --no-bare: add --no-bare --no-clean-profile (uses OAuth, no API key needed) -bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/run-case.sh \ - --id "$SCENARIO_ID" \ - --case-type "$CASE_TYPE" \ - --variant with-plugin \ - --prompt-file "$PROMPT_FILE" \ - --setup-strategy "$SETUP_STRATEGY" \ - --patch-file "${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/$PATCH_FILE" \ - --restore-files "$RESTORE_FILES" \ - --target "$TARGET" \ - --max-budget-usd "$MAX_BUDGET" \ - --output-dir "$BATCH_DIR/$SCENARIO_ID" \ - --plugin-dir "$RECCE_PLUGIN_ROOT" \ + --adapter-desc "$ADAPTER_DESC" \ --mcp-config /tmp/recce-eval-mcp-config.json \ - --run-number "$RUN_NUM" \ - ${NO_BARE:+--no-bare --no-clean-profile} \ + -n $N \ + ${MODEL:+--model "$MODEL"} \ + ${NO_BARE:+--no-bare} \ ${PROJECT_DIR:+--project-dir "$PROJECT_DIR"} ``` -Score the with-plugin run: +Where `$SCENARIO_LIST` is a comma-separated list of absolute paths to scenario YAML files (e.g., `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scenarios/v2/data-001-double-tax-deduction.yaml,${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scenarios/v2/data-002-cogs-food-only.yaml,...`). -```bash -bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/score-deterministic.sh \ - --run-file "$BATCH_DIR/$SCENARIO_ID/with-plugin_run${RUN_NUM}.json" \ - --case-type "$CASE_TYPE" \ - --ground-truth '$GROUND_TRUTH_JSON' -``` +**What `run-batch.sh` does internally:** +1. **Renders prompts** for each scenario (v2: `render-prompt.py` with template+vars; v1: inline with variable substitution) +2. **Runs interleaved loop**: for each run number (1 to N), for each scenario, baseline → score → with-plugin → score. This interleaving reduces systematic bias from cache warming or temporal effects. +3. **Scores each run** immediately with `score-deterministic.sh` +4. **Writes `batch-summary.json`** with run counts, timing, and scenario list -**Important**: The `--ground-truth` value must be a valid JSON string. Extract the `ground_truth` object from the scenario YAML and pass it as a single-quoted JSON string. Example: - -```bash ---ground-truth '{"issue_found":true,"root_cause_keywords":["null","left join","coalesce"],"impacted_models":["orders","orders_daily_summary"],"not_impacted_models":["customers","customer_segments","customer_order_pattern"],"affected_row_count":1584,"all_tests_pass":true}' -``` +**Output files** (all in `$BATCH_DIR`): +- `/baseline_run.json` — per-run JSONs with deterministic scores merged in +- `/with-plugin_run.json` — per-run JSONs with deterministic scores merged in +- `batch-summary.json` — machine-readable batch metadata (succeeded/failed counts, duration, scenario list) -**Handling setup.strategy**: When calling `run-case.sh`: -- If `setup.strategy` is `git_patch`, pass `--patch-file` pointing to `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/` and `--restore-files` as a comma-separated list from `teardown.restore_files`. -- If `setup.strategy` is `none`, pass `--setup-strategy none`. Omit `--patch-file` and `--restore-files`. +**Running in background**: This command can be run via the Bash tool's `run_in_background` parameter for long batches. When complete, proceed to Step 7. -**Error handling**: If `run-case.sh` fails (non-zero exit), log the error and continue to the next run. The teardown trap inside `run-case.sh` handles file restoration automatically. Do not add separate teardown calls here. +**Error handling**: If a single `run-case.sh` invocation fails, the batch continues to the next run. Failed runs are counted in `batch-summary.json`. The teardown trap inside `run-case.sh` handles file restoration automatically. -Report progress to the user after each run completes: "Run {N} {variant} complete: cost=${cost}, duration=${duration}s, json_extracted={yes/no}". +**Isolation mode**: `--bare` is the default (both variants get identical isolation). `--no-bare` uses OAuth auth with no API key needed. See Isolation Modes section for details. -### Step 8: Dispatch LLM Judge +### Step 7: Dispatch LLM Judge Use the Agent tool to dispatch `recce-dev:eval-judge` with a prompt that includes all the information the judge needs. Group runs by scenario so the judge can compare variants: @@ -426,7 +365,7 @@ If running multiple scenarios, dispatch the judge once per scenario (not once pe **Error handling**: If the judge agent fails or returns invalid JSON, continue without judge scores. The report will note "LLM judge: unavailable" for affected runs. -### Step 9: Merge Judge Scores +### Step 8: Merge Judge Scores Parse the judge's JSON output. For each run entry in the judge's `runs` array, read the corresponding per-run JSON file and merge `scores.llm_judge` into it: @@ -453,7 +392,7 @@ The judge returns scores per run in the format: Write `comparison_notes` to each run's `scores.llm_judge.comparison_notes` as well. -### Step 10: Write meta.json +### Step 9: Write meta.json Write batch metadata to the batch directory: @@ -476,7 +415,7 @@ EOF Where `$SCENARIOS_JSON_ARRAY` is a JSON array of scenario IDs (e.g., `["data-001-double-tax-deduction", "data-002-cogs-food-only"]`), and `$CLAUDE_MODEL` is from `--model` flag or the current session's model. -### Step 11: Generate Report +### Step 10: Generate Report Read all per-run JSONs in the batch directory (now containing both deterministic and judge scores). Follow the structure defined in `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/references/report-template.md`. @@ -498,7 +437,7 @@ The report includes: 4. **Detailed Scores** with per-run deterministic checks and judge scores 5. **Cross-Eval Comparison** with historical deltas (if available) -### Step 12: Update History and Print Summary +### Step 11: Update History and Print Summary Append a summary entry to `.claude/recce-eval/history.json`: @@ -603,27 +542,27 @@ bash run-case.sh --id ch3-phantom-filter --variant with-plugin \ ## Common Mistakes -- **Shell variables do not persist**: Each Bash tool invocation starts a fresh shell. Re-derive `EVAL_ID`, `BATCH_DIR`, `TARGET`, `ADAPTER`, `RECCE_PLUGIN_ROOT`, and other state in every Bash block that needs them. Do not assume a previous Bash call's variables are available. +- **Shell variables do not persist**: Each Bash tool invocation starts a fresh shell. Re-derive `EVAL_ID`, `BATCH_DIR`, `TARGET`, `ADAPTER`, `RECCE_PLUGIN_ROOT`, and other state in every Bash block that needs them. Do not assume a previous Bash call's variables are available. Note: `run-batch.sh` eliminates this problem for the run loop (Step 6), but Steps 1-4 and 7-11 still run as separate Bash calls. - **Forgetting `eval`**: Running `bash resolve-recce-root.sh` without `eval "$(...)"` does not set `RECCE_PLUGIN_ROOT` in the current shell. - **Platform-specific `md5`**: macOS uses `md5`, Linux uses `md5sum`. The eval scripts handle both — do not simplify to one. -- **MCP config uses `--strict-mcp-config`**: The eval config must be the sole MCP source. `run-case.sh` passes `--strict-mcp-config --mcp-config` so the eval port is guaranteed. The eval config in Step 6 must include both `recce` (eval port) and `recce-docs` (from `$RECCE_PLUGIN_ROOT`). +- **MCP config uses `--strict-mcp-config`**: The eval config must be the sole MCP source. `run-case.sh` passes `--strict-mcp-config --mcp-config` so the eval port is guaranteed. The eval config in Step 5 must include both `recce` (eval port) and `recce-docs` (from `$RECCE_PLUGIN_ROOT`). - **`--mcp-config` is variadic**: `--mcp-config ` consumes subsequent positional arguments. The `--` separator before the prompt in `run-case.sh` prevents the prompt from being parsed as a config argument. Do not remove it. -- **Interleaved order matters**: Run baseline then with-plugin for the same run number before moving to the next run number. Do not group all baselines then all with-plugins — this introduces systematic bias. +- **Interleaved order matters**: `run-batch.sh` handles this automatically — baseline then with-plugin for each run number. If running manually without `run-batch.sh`, do not group all baselines then all with-plugins — this introduces systematic bias. -- **Teardown is trap-based in run-case.sh**: The script restores files even if `claude -p` fails. Do not add separate teardown calls in the SKILL.md orchestration. +- **Teardown is trap-based in run-case.sh**: The script restores files even if `claude -p` fails. Do not add separate teardown calls in the orchestration. -- **Ground truth as JSON string**: When passing `--ground-truth` to `score-deterministic.sh`, the value must be a valid JSON string. Use single quotes around the entire JSON value in bash to prevent shell expansion. +- **Ground truth as JSON string**: `run-batch.sh` handles this automatically via `yq -o=json | jq -c`. If running `score-deterministic.sh` manually, the `--ground-truth` value must be a valid JSON string. - **Adapter detection uses `yq`**: Do not use grep to parse profiles.yml. The target's adapter type depends on the nested YAML structure which requires proper YAML parsing. - **stdio MCP needs no lifecycle management**: With stdio transport, claude spawns/kills the MCP server automatically. No `start-eval-mcp.sh` / `stop-eval-mcp.sh` calls needed. The `start-eval-mcp.sh` and `stop-eval-mcp.sh` scripts are retained for SSE mode fallback only. -- **Prompt file per scenario**: When running `--all`, create a separate prompt file for each scenario (use `${EVAL_ID}-${SCENARIO_ID}` in the filename) since each scenario has a different prompt. +- **Prompt file per scenario**: `run-batch.sh` handles this automatically (naming: `/tmp/recce-eval-prompt-${EVAL_ID}-${SCENARIO_ID}.txt`). If running manually, create a separate prompt file for each scenario. - **v2 project cleanup**: When `--version v2`, clean up `WORK_DIR` at the end of the Run Flow. Always guard with a `$TMPDIR` prefix check before `rm -rf` to avoid accidental deletion outside temp. @@ -637,9 +576,10 @@ bash run-case.sh --id ch3-phantom-filter --variant with-plugin \ ### Scripts +- **`scripts/run-batch.sh`** — Batch eval runner: renders prompts, runs interleaved loop (baseline→score→with-plugin→score per run per scenario), writes `batch-summary.json`. Background-capable. Encapsulates Steps 5-7 from the original orchestration. - **`scripts/list-scenarios.sh`** — List scenarios for a version. Single `yq eval-all` call. Outputs pipe-delimited rows. -- **`scripts/run-case.sh`** — Atomic runner: setup state, invoke `claude -p`, capture output, teardown, write per-run JSON. Outputs KEY=VALUE lines. -- **`scripts/score-deterministic.sh`** — jq-based scoring against ground truth. Reads and updates per-run JSON in-place. Outputs KEY=VALUE lines. +- **`scripts/run-case.sh`** — Atomic runner: setup state, invoke `claude -p`, capture output, teardown, write per-run JSON. Outputs KEY=VALUE lines. Called by `run-batch.sh`. +- **`scripts/score-deterministic.sh`** — jq-based scoring against ground truth. Reads and updates per-run JSON in-place. Outputs KEY=VALUE lines. Called by `run-batch.sh`. - **`scripts/setup-v2-project.sh`** — Clone a dbt project repo to a temp dir and bootstrap (venv, dbt deps, seed). Used by v2 scenarios only. Outputs `PROJECT_DIR=` and `WORK_DIR=`. - **`scripts/start-eval-mcp.sh`** — Start Recce MCP server on eval-specific port (default 8085). Retained for SSE mode fallback only. - **`scripts/stop-eval-mcp.sh`** — Stop eval MCP server. Retained for SSE mode fallback only. diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/SCENARIOS.md b/plugins/recce-dev/skills/recce-eval/scenarios/v2/SCENARIOS.md index c926381..e68a39a 100644 --- a/plugins/recce-dev/skills/recce-eval/scenarios/v2/SCENARIOS.md +++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/SCENARIOS.md @@ -237,6 +237,104 @@ where subtotal > 0 --- +## data-007: Supply Cost Breakdown — Hidden Fan-out Cascade + +**GitHub Issue**: [#4 — Add Supply Cost Analysis and Perishable Inventory Tracking](https://github.com/DataRecce/jaffle-shop-simulator/issues/4) + +**Story**: Purchasing Manager requests perishable vs non-perishable supply cost breakdown per order item. A teammate modifies the `order_supplies_summary` CTE in `order_items.sql` to add `is_perishable_supply` to the GROUP BY. + +**Init state (buggy PR)**: +```sql +-- order_items.sql — order_supplies_summary CTE +select + product_id, + is_perishable_supply, + sum(supply_cost) as supply_cost +from supplies +group by 1, 2 +``` + +**The bug**: Adding `is_perishable_supply` to GROUP BY changes the grain from 1 row/product to 2 rows/product (perishable + non-perishable). The downstream `LEFT JOIN` fans out every order_item into 2 rows. This cascades: +- `order_items`: row count approximately doubles +- `orders.order_cost`: UNCHANGED (sum of split costs = original total) +- `orders.count_order_items`: DOUBLED +- `orders.count_food_items`: DOUBLED (dashboard column!) +- `orders.count_drink_items`: DOUBLED (dashboard column!) +- `orders.order_items_subtotal`: DOUBLED (sum of duplicated product_price) +- `customers`: UNCHANGED (uses order-level columns, not order_items) + +**What we expect the agent to find**: +- Issue found: **yes** — data drift +- Root cause: grain change in order_supplies_summary fans out the join +- Impacted: `order_items`, `orders` +- Not impacted: `stg_orders`, `customers`, `products`, `supplies` +- Dashboard impact: **yes** (count_food_items, count_drink_items doubled) +- Detection requires: **data comparison** + +**Difficulty**: hard — the grain change looks innocent (adding a dimension), but cascades through orders into dashboard columns + +--- + +## data-008: Numeric Precision Refactor — Zero-Change False Positive Trap + +**GitHub Issue**: [#2 — Add Tax Summary Report and Cost Accounting Breakdown](https://github.com/DataRecce/jaffle-shop-simulator/issues/2) + +**Story**: Data Engineer wraps all three `cents_to_dollars()` calls in `stg_orders.sql` with `round(..., 2)` for "defensive precision." + +**Init state (buggy PR)**: +```sql +-- stg_orders.sql +round({{ cents_to_dollars('subtotal') }}, 2) as subtotal, +round({{ cents_to_dollars('tax_paid') }}, 2) as tax_paid, +round({{ cents_to_dollars('order_total') }}, 2) as order_total, +``` + +**The bug**: There is NO bug. The `cents_to_dollars` macro already casts to `numeric(16, 2)`. Applying `round(x, 2)` to a value that is already `numeric(16, 2)` is a complete no-op — zero rows change, zero values change across the entire DAG. + +**What we expect the agent to find**: +- Issue found: **no** — the change is a no-op +- Root cause: round() on already-rounded numeric is redundant +- Impacted: none +- Not impacted: `stg_orders`, `orders`, `customers`, `order_items`, `products` +- Dashboard impact: **no** +- Detection requires: **data comparison** (to confirm zero change, not just code reasoning) + +**Difficulty**: medium — the agent must resist the trap of reporting impact based on DAG reasoning alone (stg_orders is root → everything downstream "could" be affected) + +--- + +## data-009: Date Truncation Change — Month Grain Collapses Daily Timeline + +**GitHub Issue**: [#9 — Optimize Date Granularity for Monthly Reporting](https://github.com/DataRecce/jaffle-shop-simulator/issues/9) + +**Story**: Analytics Engineer changes `date_trunc` in `stg_orders.sql` from `'day'` to `'month'` to "reduce cardinality and improve query performance." + +**Init state (buggy PR)**: +```sql +-- stg_orders.sql +{{ dbt.date_trunc('month','ordered_at') }} as ordered_at +``` + +**The bug**: `ordered_at` loses daily granularity — all orders in the same month collapse to the 1st of the month. This propagates through the entire DAG: +- `orders.ordered_at` — month-level (dashboard column!) +- `orders.customer_order_number` — ROW_NUMBER by month becomes non-deterministic +- `order_items.ordered_at` — month-level +- `customers.first_ordered_at` / `last_ordered_at` — month-level only + +Financial columns (subtotal, tax_paid, order_total) are completely unchanged. Row counts are identical — impact is purely value-level on date columns. + +**What we expect the agent to find**: +- Issue found: **yes** — data drift +- Root cause: date_trunc changed from day to month, collapsing daily granularity +- Impacted: `stg_orders`, `orders`, `order_items`, `customers` +- Not impacted: `products`, `supplies`, `locations` +- Dashboard impact: **yes** (ordered_at is a dashboard column) +- Detection requires: **data comparison** + +**Difficulty**: medium — the agent must correctly scope impact to date columns only and avoid false positives on financial metrics + +--- + ## Summary Matrix | ID | Bug Type | Modified/New | Difficulty | Detection | Dashboard? | Affected Rows | @@ -247,4 +345,7 @@ where subtotal > 0 | data-004 | Count ratio vs cost ratio | New `supply_analysis` | medium | data comparison | no | all rows | | data-005 | current_date on historical data | New `customer_segments` | easy | data comparison | no | all rows | | data-006 | Tax instead of COGS in formula | New `financial_orders` | easy | data comparison | no | all rows | +| data-007 | Grain fan-out cascades to dashboard | Modified `order_items` | hard | data comparison | yes | all rows (doubled) | +| data-008 | No-op precision change (false positive trap) | Modified `stg_orders` | medium | data comparison | no | 0 | +| data-009 | Date grain collapse (day→month) | Modified `stg_orders` | medium | data comparison | yes | 658,657 | | code-001 | Wrong filter column (spec deviation) | Modified `stg_orders` | hard | code review | no | 4,155 | diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-007-supply-grain-fanout.yaml b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-007-supply-grain-fanout.yaml new file mode 100644 index 0000000..5cca3ba --- /dev/null +++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-007-supply-grain-fanout.yaml @@ -0,0 +1,72 @@ +id: data-007-supply-grain-fanout +name: "Supply Cost Breakdown — Hidden Fan-out Cascade" +description: "order_items supply summary adds is_perishable_supply to GROUP BY — grain change fans out join, doubling count columns through orders mart into dashboard" +github_issue: https://github.com/DataRecce/jaffle-shop-simulator/issues/4 +layer: review +difficulty: hard +stakeholder: purchasing +case_type: problem_exists + +story: | + The Purchasing Manager (P2) requested a breakdown of perishable vs non-perishable supply + costs per order item, to better understand spoilage risk in the supply chain. + + A teammate modified the `order_supplies_summary` CTE in `order_items.sql` to include + `is_perishable_supply` in the GROUP BY and SELECT. This splits each product's supply cost + into two rows: one for perishable supplies, one for non-perishable supplies. + + The code change looks reasonable — adding a dimension to an aggregation. But it changes + the grain of `order_supplies_summary` from 1 row per product to 2 rows per product. + The downstream LEFT JOIN in the `joined` CTE now produces 2 rows per order_item (one for + each perishable category). This fan-out cascades: + + - `order_items`: row count approximately doubles + - `orders.order_cost`: UNCHANGED (sum of split costs = original total) + - `orders.count_order_items`: DOUBLED (counts duplicated rows) + - `orders.count_food_items`: DOUBLED (dashboard column!) + - `orders.count_drink_items`: DOUBLED (dashboard column!) + - `orders.order_items_subtotal`: DOUBLED (sum of duplicated product_price) + - `customers`: UNCHANGED (aggregates use order-level columns from stg_orders, not order_items) + + The bug is a classic grain mismatch hidden behind an innocent-looking GROUP BY change. + +environment: + repo: DataRecce/jaffle-shop-simulator + ref: eval-base + adapter: duckdb + +setup: + strategy: git_patch + patch_reverse_file: scenarios/v2/patches/data-007-supply-grain-fanout.patch + skip_context: false + +prompt: + template: prompts/review.md + vars: + stakeholder_name: "Purchasing Manager (P2)" + stakeholder_request: "Add perishable vs non-perishable supply cost breakdown per order item for spoilage risk analysis" + pr_description: "Add is_perishable_supply dimension to order_items supply cost aggregation — splits supply_cost into perishable and non-perishable components" + +headless: + max_budget_usd: 5.00 + output_format: json + +ground_truth: + issue_found: true + issue_type: data_drift + root_cause_keywords: ["grain", "fan-out", "group by", "is_perishable_supply", "duplicate", "count", "order_supplies_summary", "double"] + impacted_models: ["order_items", "orders"] + not_impacted_models: ["stg_orders", "customers", "products", "supplies"] + dashboard_impact: true + detection_requires: data_comparison + +judge_criteria: + - "Agent identifies the grain change in order_supplies_summary (1 row/product → 2 rows/product)" + - "Agent recognizes the fan-out cascade: order_items rows doubled → orders count columns doubled" + - "Agent notes that order_cost (sum of supply_cost) is UNCHANGED despite the fan-out — sum of parts equals the original total" + - "Agent identifies that count_food_items and count_drink_items are DOUBLED — these are Executive Dashboard columns" + - "Agent correctly identifies that customers model is NOT impacted" + - "Agent correctly identifies dashboard_impact as true (count_food_items, count_drink_items)" + +teardown: + restore_files: ["models/marts/order_items.sql"] diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-008-precision-noop.yaml b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-008-precision-noop.yaml new file mode 100644 index 0000000..b2195d8 --- /dev/null +++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-008-precision-noop.yaml @@ -0,0 +1,69 @@ +id: data-008-precision-noop +name: "Numeric Precision Refactor — Zero-Change False Positive Trap" +description: "stg_orders wraps cents_to_dollars with round(x, 2) — macro already outputs numeric(16,2) so data is identical, but code diff touches root staging model" +github_issue: https://github.com/DataRecce/jaffle-shop-simulator/issues/2 +layer: review +difficulty: medium +stakeholder: data-engineering +case_type: no_problem + +story: | + A Data Engineer noticed that the `cents_to_dollars` macro returns `::numeric(16, 2)` but + wanted to make the precision "explicit and defensive" by wrapping all three money columns + in `stg_orders.sql` with `round(..., 2)`. + + The PR description says: "Add explicit rounding to money columns for precision safety — + ensures no floating point drift in downstream aggregations." + + The change modifies `stg_orders.sql`, which is the ROOT staging model feeding into + `orders`, `customers`, and every downstream mart. A code-only reviewer seeing a change + to the root financial staging model would reasonably flag this as high-risk and report + potential impact on all downstream models. + + However, `cents_to_dollars` already casts to `numeric(16, 2)`. Applying `round(x, 2)` to + a value that is already `numeric(16, 2)` is a complete no-op — zero rows change, zero + values change, zero downstream impact. The correct assessment is: no issue found. + + This scenario tests whether the agent can use data comparison to CONFIRM safety rather + than relying on DAG reasoning alone (which would produce false positives). + +environment: + repo: DataRecce/jaffle-shop-simulator + ref: eval-base + adapter: duckdb + +setup: + strategy: git_patch + patch_reverse_file: scenarios/v2/patches/data-008-precision-noop.patch + skip_context: false + +prompt: + template: prompts/review.md + vars: + stakeholder_name: "Data Engineer (P3)" + stakeholder_request: "Add explicit rounding to money columns in stg_orders for precision safety" + pr_description: "Wrap cents_to_dollars output with round(x, 2) in stg_orders — defensive precision for downstream financial aggregations" + +headless: + max_budget_usd: 5.00 + output_format: json + +ground_truth: + issue_found: false + issue_type: no_issue + root_cause_keywords: ["no-op", "round", "numeric", "precision", "already", "identical", "no change", "zero"] + impacted_models: [] + not_impacted_models: ["stg_orders", "orders", "customers", "order_items", "products"] + dashboard_impact: false + detection_requires: data_comparison + +judge_criteria: + - "Agent verifies through DATA comparison that all downstream models have zero value changes" + - "Agent recognizes that round(numeric(16,2), 2) is a no-op — the macro already handles precision" + - "Agent does NOT report false positives on orders, customers, or other downstream models" + - "Agent correctly concludes issue_found: false — no data impact despite code change to root model" + - "Agent correctly identifies dashboard_impact as false" + - "Agent avoids the trap of DAG-based reasoning alone (stg_orders is root → everything must be impacted)" + +teardown: + restore_files: ["models/staging/stg_orders.sql"] diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-009-date-grain-month.yaml b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-009-date-grain-month.yaml new file mode 100644 index 0000000..bfd1796 --- /dev/null +++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-009-date-grain-month.yaml @@ -0,0 +1,78 @@ +id: data-009-date-grain-month +name: "Date Truncation Change — Month Grain Collapses Daily Timeline" +description: "stg_orders changes date_trunc from day to month — ordered_at loses daily granularity across entire DAG, but financial columns are unchanged" +github_issue: https://github.com/DataRecce/jaffle-shop-simulator/issues/9 +layer: review +difficulty: medium +stakeholder: analytics +case_type: problem_exists + +story: | + An Analytics Engineer proposed changing the date truncation in `stg_orders.sql` from + `day` to `month` to "reduce cardinality and improve query performance for monthly + reporting dashboards." + + The PR modifies one line in `stg_orders.sql`: + - Before: `date_trunc('day', ordered_at)` + - After: `date_trunc('month', ordered_at)` + + The change compiles fine and all dbt tests pass. The PR description argues this is a + harmless optimization since "most reports aggregate to monthly anyway." + + However, `stg_orders` is the ROOT staging model for the entire orders pipeline. The + `ordered_at` column propagates through: + - `orders.ordered_at` — now month-level (dashboard column!) + - `orders.customer_order_number` — ROW_NUMBER ordered by month becomes non-deterministic + for orders within the same month + - `order_items.ordered_at` — joined from stg_orders, now month-level + - `customers.first_ordered_at` — now month-level only (loses day precision) + - `customers.last_ordered_at` — now month-level only (loses day precision) + + Critically, financial columns (subtotal, tax_paid, order_total, order_cost) are + COMPLETELY UNCHANGED. The agent must correctly scope the impact to date/time columns + only and avoid false positives on financial metrics. + + Row counts are identical across all models — no rows added or removed. The impact is + purely in value changes to the ordered_at column and its derivatives. + +environment: + repo: DataRecce/jaffle-shop-simulator + ref: eval-base + adapter: duckdb + +setup: + strategy: git_patch + patch_reverse_file: scenarios/v2/patches/data-009-date-grain-month.patch + skip_context: false + +prompt: + template: prompts/review.md + vars: + stakeholder_name: "Analytics Engineer (P3)" + stakeholder_request: "Optimize date granularity in stg_orders from daily to monthly for reporting performance" + pr_description: "Change date_trunc from day to month in stg_orders — reduces ordered_at cardinality for faster monthly aggregations" + +headless: + max_budget_usd: 5.00 + output_format: json + +ground_truth: + issue_found: true + issue_type: data_drift + root_cause_keywords: ["date_trunc", "month", "day", "ordered_at", "granularity", "precision", "cardinality"] + impacted_models: ["stg_orders", "orders", "order_items", "customers"] + not_impacted_models: ["products", "supplies", "locations"] + dashboard_impact: true + detection_requires: data_comparison + +judge_criteria: + - "Agent identifies that ordered_at loses daily granularity — collapses to month-level across the DAG" + - "Agent correctly identifies dashboard_impact as true (ordered_at is a dashboard column)" + - "Agent correctly identifies that financial columns (subtotal, tax_paid, order_total) are UNCHANGED" + - "Agent correctly scopes impacted_models to those that use ordered_at: stg_orders, orders, order_items, customers" + - "Agent does NOT falsely report products, supplies, or locations as impacted" + - "Agent notes that customer_order_number becomes non-deterministic for same-month orders" + - "Agent recognizes row counts are unchanged — the impact is value-level, not row-level" + +teardown: + restore_files: ["models/staging/stg_orders.sql"] diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-007-supply-grain-fanout.patch b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-007-supply-grain-fanout.patch new file mode 100644 index 0000000..900c962 --- /dev/null +++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-007-supply-grain-fanout.patch @@ -0,0 +1,26 @@ +diff --git a/models/marts/order_items.sql b/models/marts/order_items.sql +--- a/models/marts/order_items.sql ++++ b/models/marts/order_items.sql +@@ -29,13 +29,12 @@ + + select + product_id, +- is_perishable_supply, + + sum(supply_cost) as supply_cost + + from supplies + +- group by 1, 2 ++ group by 1 + + ), + +@@ -51,7 +50,6 @@ + products.is_food_item, + products.is_drink_item, + +- order_supplies_summary.is_perishable_supply, + order_supplies_summary.supply_cost + + from order_items diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-008-precision-noop.patch b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-008-precision-noop.patch new file mode 100644 index 0000000..acb6c63 --- /dev/null +++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-008-precision-noop.patch @@ -0,0 +1,16 @@ +diff --git a/models/staging/stg_orders.sql b/models/staging/stg_orders.sql +--- a/models/staging/stg_orders.sql ++++ b/models/staging/stg_orders.sql +@@ -19,9 +19,9 @@ + subtotal as subtotal_cents, + tax_paid as tax_paid_cents, + order_total as order_total_cents, +- round({{ cents_to_dollars('subtotal') }}, 2) as subtotal, +- round({{ cents_to_dollars('tax_paid') }}, 2) as tax_paid, +- round({{ cents_to_dollars('order_total') }}, 2) as order_total, ++ {{ cents_to_dollars('subtotal') }} as subtotal, ++ {{ cents_to_dollars('tax_paid') }} as tax_paid, ++ {{ cents_to_dollars('order_total') }} as order_total, + + ---------- timestamps + {{ dbt.date_trunc('day','ordered_at') }} as ordered_at diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-009-date-grain-month.patch b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-009-date-grain-month.patch new file mode 100644 index 0000000..188f689 --- /dev/null +++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-009-date-grain-month.patch @@ -0,0 +1,12 @@ +diff --git a/models/staging/stg_orders.sql b/models/staging/stg_orders.sql +--- a/models/staging/stg_orders.sql ++++ b/models/staging/stg_orders.sql +@@ -24,7 +24,7 @@ + {{ cents_to_dollars('order_total') }} as order_total, + + ---------- timestamps +- {{ dbt.date_trunc('month','ordered_at') }} as ordered_at ++ {{ dbt.date_trunc('day','ordered_at') }} as ordered_at + + from source + diff --git a/plugins/recce-dev/skills/recce-eval/scripts/run-batch.sh b/plugins/recce-dev/skills/recce-eval/scripts/run-batch.sh new file mode 100755 index 0000000..ae76b06 --- /dev/null +++ b/plugins/recce-dev/skills/recce-eval/scripts/run-batch.sh @@ -0,0 +1,262 @@ +#!/bin/bash +# run-batch.sh — Batch eval runner: render prompts → interleaved run loop → score +# +# Encapsulates SKILL.md Steps 5-7 into a single background-capable script. +# Caller (SKILL.md) handles Steps 1-4 (parse scenarios, bootstrap project, +# detect adapter, create batch dir) and Steps 8-12 (judge, meta, report). +# +# Usage: +# bash run-batch.sh \ +# --scenarios scenario1.yaml,scenario2.yaml \ +# --batch-dir /path/to/batch \ +# --eval-id 20260404-1530 \ +# --skill-dir /path/to/recce-eval \ +# --recce-plugin /path/to/recce-plugin \ +# --target dev \ +# --adapter-desc "DuckDB (local file database, target: dev)" \ +# [--mcp-config /tmp/mcp.json] \ +# [-n 3] [--model claude-sonnet-4-20250514] [--mode real-world] \ +# [--no-bare] [--project-dir /path/to/project] +# +# Output: +# - Per-run JSONs: $BATCH_DIR//_run.json (via run-case.sh) +# - Deterministic scores merged into per-run JSONs (via score-deterministic.sh) +# - Batch summary: $BATCH_DIR/batch-summary.json +# - Progress lines to stdout +set -euo pipefail + +# ========== Argument Parsing ========== +SCENARIOS="" BATCH_DIR="" EVAL_ID="" SKILL_DIR="" RECCE_PLUGIN="" +TARGET="" ADAPTER_DESC="" MCP_CONFIG="" RUNS=1 MODEL="" MODE="real-world" +NO_BARE="" PROJECT_DIR="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --scenarios) SCENARIOS="$2"; shift 2 ;; + --batch-dir) BATCH_DIR="$2"; shift 2 ;; + --eval-id) EVAL_ID="$2"; shift 2 ;; + --skill-dir) SKILL_DIR="$2"; shift 2 ;; + --recce-plugin) RECCE_PLUGIN="$2"; shift 2 ;; + --target) TARGET="$2"; shift 2 ;; + --adapter-desc) ADAPTER_DESC="$2"; shift 2 ;; + --mcp-config) MCP_CONFIG="$2"; shift 2 ;; + -n|--runs) RUNS="$2"; shift 2 ;; + --model) MODEL="$2"; shift 2 ;; + --mode) MODE="$2"; shift 2 ;; + --no-bare) NO_BARE="true"; shift 1 ;; + --project-dir) PROJECT_DIR="$2"; shift 2 ;; + *) echo "ERROR: Unknown argument: $1" >&2; exit 1 ;; + esac +done + +# ========== Validation ========== +MISSING="" +[ -z "$SCENARIOS" ] && MISSING="$MISSING --scenarios" +[ -z "$BATCH_DIR" ] && MISSING="$MISSING --batch-dir" +[ -z "$EVAL_ID" ] && MISSING="$MISSING --eval-id" +[ -z "$SKILL_DIR" ] && MISSING="$MISSING --skill-dir" +[ -z "$RECCE_PLUGIN" ] && MISSING="$MISSING --recce-plugin" +[ -z "$TARGET" ] && MISSING="$MISSING --target" +[ -z "$ADAPTER_DESC" ] && MISSING="$MISSING --adapter-desc" + +if [ -n "$MISSING" ]; then + echo "ERROR: Missing required arguments:$MISSING" >&2 + exit 1 +fi + +for cmd in yq jq python3; do + if ! command -v "$cmd" &>/dev/null; then + echo "ERROR: Required command not found: $cmd" >&2 + exit 1 + fi +done + +IFS=',' read -ra SCENARIO_FILES_RAW <<< "$SCENARIOS" +SCENARIO_FILES=() +for f in "${SCENARIO_FILES_RAW[@]}"; do + f="${f#"${f%%[![:space:]]*}"}" # trim leading whitespace + f="${f%"${f##*[![:space:]]}"}" # trim trailing whitespace + if [ ! -f "$f" ]; then + echo "ERROR: Scenario file not found: $f" >&2 + exit 1 + fi + SCENARIO_FILES+=("$f") +done + +mkdir -p "$BATCH_DIR" + +# ========== Phase 1: Render Prompts ========== +echo "=== Phase 1: Rendering prompts for ${#SCENARIO_FILES[@]} scenarios ===" + +for SCENARIO_FILE in "${SCENARIO_FILES[@]}"; do + SCENARIO_ID=$(yq -r '.id' "$SCENARIO_FILE") + TEMPLATE=$(yq -r '.prompt.template // ""' "$SCENARIO_FILE") + PROMPT_FILE="/tmp/recce-eval-prompt-${EVAL_ID}-${SCENARIO_ID}.txt" + + if [ -n "$TEMPLATE" ]; then + # v2: template + vars substituted by render-prompt.py + python3 "${SKILL_DIR}/scripts/render-prompt.py" \ + "${SKILL_DIR}/${TEMPLATE}" "$SCENARIO_FILE" \ + --var "adapter_description=${ADAPTER_DESC}" \ + --var "target=${TARGET}" \ + > "$PROMPT_FILE" + else + # v1: inline prompt with runtime variable substitution + PROMPT_TEXT=$(yq -r '.prompt' "$SCENARIO_FILE") + PROMPT_TEXT="${PROMPT_TEXT//\{adapter_description\}/$ADAPTER_DESC}" + PROMPT_TEXT="${PROMPT_TEXT//\{target\}/$TARGET}" + printf '%s' "$PROMPT_TEXT" > "$PROMPT_FILE" + fi + + echo " [ok] $SCENARIO_ID" +done + +# ========== Phase 2: Interleaved Run Loop ========== +# Order: for each run_num → for each scenario → baseline then with-plugin. +# Interleaving reduces systematic bias from cache warming or temporal effects. +TOTAL_RUNS=$(( ${#SCENARIO_FILES[@]} * RUNS * 2 )) +echo "" +echo "=== Phase 2: Running $TOTAL_RUNS cases (${#SCENARIO_FILES[@]} scenarios x $RUNS runs x 2 variants) ===" +echo "" + +RUN_INDEX=0 +SUCCEEDED=0 +FAILED=0 +BATCH_START=$(date +%s) + +for (( run_num=1; run_num<=RUNS; run_num++ )); do + for SCENARIO_FILE in "${SCENARIO_FILES[@]}"; do + # Parse scenario metadata once per scenario per run_num + SCENARIO_ID=$(yq -r '.id' "$SCENARIO_FILE") + CASE_TYPE=$(yq -r '.case_type' "$SCENARIO_FILE") + SETUP_STRATEGY=$(yq -r '.setup.strategy' "$SCENARIO_FILE") + PATCH_REL=$(yq -r '.setup.patch_reverse_file // ""' "$SCENARIO_FILE") + SKIP_CTX=$(yq -r '.setup.skip_context // "false"' "$SCENARIO_FILE") + RESTORE=$(yq -r '.teardown.restore_files // [] | join(",")' "$SCENARIO_FILE") + MAX_BUDGET=$(yq -r '.headless.max_budget_usd' "$SCENARIO_FILE") + GT_JSON=$(yq -o=json '.ground_truth' "$SCENARIO_FILE" | jq -c .) + PROMPT_FILE="/tmp/recce-eval-prompt-${EVAL_ID}-${SCENARIO_ID}.txt" + + mkdir -p "$BATCH_DIR/$SCENARIO_ID" + + for VARIANT in baseline with-plugin; do + RUN_INDEX=$(( RUN_INDEX + 1 )) + CASE_START=$(date +%s) + echo "[${RUN_INDEX}/${TOTAL_RUNS}] ${SCENARIO_ID} ${VARIANT} run${run_num} — starting" + + # Build run-case.sh argument list + RUN_ARGS=( + --id "$SCENARIO_ID" + --case-type "$CASE_TYPE" + --variant "$VARIANT" + --prompt-file "$PROMPT_FILE" + --setup-strategy "$SETUP_STRATEGY" + --target "$TARGET" + --max-budget-usd "$MAX_BUDGET" + --output-dir "$BATCH_DIR/$SCENARIO_ID" + --run-number "$run_num" + ) + + # Isolation mode: --bare (default) or --no-bare + if [ -z "$NO_BARE" ]; then + RUN_ARGS+=(--bare) + else + RUN_ARGS+=(--no-bare --no-clean-profile) + fi + + # Patch file (only for git_patch strategy) + if [ "$SETUP_STRATEGY" = "git_patch" ] && [ -n "$PATCH_REL" ] && [ "$PATCH_REL" != "null" ]; then + RUN_ARGS+=(--patch-file "${SKILL_DIR}/${PATCH_REL}") + fi + [ -n "$RESTORE" ] && RUN_ARGS+=(--restore-files "$RESTORE") + + # With-plugin variant: inject plugin + MCP + if [ "$VARIANT" = "with-plugin" ]; then + RUN_ARGS+=(--plugin-dir "$RECCE_PLUGIN") + [ -n "$MCP_CONFIG" ] && RUN_ARGS+=(--mcp-config "$MCP_CONFIG") + fi + + # Optional flags + [ -n "$MODEL" ] && RUN_ARGS+=(--model "$MODEL") + RUN_ARGS+=(--mode "$MODE") + [ -n "$PROJECT_DIR" ] && RUN_ARGS+=(--project-dir "$PROJECT_DIR") + [ "$SKIP_CTX" = "true" ] && RUN_ARGS+=(--skip-setup-context) + + # Execute run-case.sh + RUN_FILE="$BATCH_DIR/$SCENARIO_ID/${VARIANT}_run${run_num}.json" + RUN_OUTPUT="" + if RUN_OUTPUT=$(bash "${SKILL_DIR}/scripts/run-case.sh" "${RUN_ARGS[@]}" 2>&1); then + # Parse KEY=VALUE output from run-case.sh + COST=$(echo "$RUN_OUTPUT" | grep "^TOTAL_COST_USD=" | cut -d= -f2 || echo "?") + DURATION=$(echo "$RUN_OUTPUT" | grep "^DURATION_MS=" | cut -d= -f2 || echo "0") + JSON_OK=$(echo "$RUN_OUTPUT" | grep "^JSON_EXTRACTED=" | cut -d= -f2 || echo "?") + + # Score immediately after each run + SCORE_OUTPUT="" + if SCORE_OUTPUT=$(bash "${SKILL_DIR}/scripts/score-deterministic.sh" \ + --run-file "$RUN_FILE" \ + --case-type "$CASE_TYPE" \ + --ground-truth "$GT_JSON" 2>&1); then + PASS_RATE=$(echo "$SCORE_OUTPUT" | grep "^PASS_RATE=" | cut -d= -f2 || echo "?") + else + PASS_RATE="score-error" + fi + + DURATION_SEC=$(( ${DURATION:-0} / 1000 )) + echo "[${RUN_INDEX}/${TOTAL_RUNS}] ${SCENARIO_ID} ${VARIANT} run${run_num} — DONE cost=\$${COST} duration=${DURATION_SEC}s json=${JSON_OK} pass_rate=${PASS_RATE}" + SUCCEEDED=$(( SUCCEEDED + 1 )) + else + CASE_END=$(date +%s) + WALL_SEC=$(( CASE_END - CASE_START )) + echo "[${RUN_INDEX}/${TOTAL_RUNS}] ${SCENARIO_ID} ${VARIANT} run${run_num} — FAILED after ${WALL_SEC}s" + echo "$RUN_OUTPUT" | tail -3 | sed 's/^/ > /' + FAILED=$(( FAILED + 1 )) + fi + done + done +done + +# ========== Phase 3: Summary ========== +BATCH_END=$(date +%s) +BATCH_DURATION=$(( BATCH_END - BATCH_START )) +BATCH_MINUTES=$(( BATCH_DURATION / 60 )) +BATCH_SECONDS=$(( BATCH_DURATION % 60 )) + +echo "" +echo "=== BATCH COMPLETE ===" +echo "Eval ID: $EVAL_ID" +echo "Succeeded: $SUCCEEDED / $TOTAL_RUNS" +echo "Failed: $FAILED / $TOTAL_RUNS" +echo "Duration: ${BATCH_MINUTES}m ${BATCH_SECONDS}s" +echo "Output: $BATCH_DIR" + +# Write machine-readable summary for SKILL.md Steps 8-12 +SCENARIO_IDS="[]" +for SCENARIO_FILE in "${SCENARIO_FILES[@]}"; do + SID=$(yq -r '.id' "$SCENARIO_FILE") + SCENARIO_IDS=$(echo "$SCENARIO_IDS" | jq --arg id "$SID" '. + [$id]') +done + +jq -n \ + --arg eval_id "$EVAL_ID" \ + --argjson total "$TOTAL_RUNS" \ + --argjson succeeded "$SUCCEEDED" \ + --argjson failed "$FAILED" \ + --argjson runs "$RUNS" \ + --argjson scenarios "$SCENARIO_IDS" \ + --arg batch_dir "$BATCH_DIR" \ + --arg completed_at "$(date -u +"%Y-%m-%dT%H:%M:%SZ")" \ + --argjson duration_sec "$BATCH_DURATION" \ + '{ + eval_id: $eval_id, + total_runs: $total, + succeeded: $succeeded, + failed: $failed, + runs_per_scenario: $runs, + scenarios: $scenarios, + batch_dir: $batch_dir, + completed_at: $completed_at, + duration_sec: $duration_sec + }' > "$BATCH_DIR/batch-summary.json" + +echo "Summary: $BATCH_DIR/batch-summary.json" diff --git a/plugins/recce-dev/skills/recce-eval/scripts/run-case.sh b/plugins/recce-dev/skills/recce-eval/scripts/run-case.sh index 6b0047d..9aeab4d 100755 --- a/plugins/recce-dev/skills/recce-eval/scripts/run-case.sh +++ b/plugins/recce-dev/skills/recce-eval/scripts/run-case.sh @@ -153,13 +153,12 @@ cleanup() { f="${f#"${f%%[![:space:]]*}"}" # trim leading whitespace f="${f%"${f##*[![:space:]]}"}" # trim trailing whitespace if [ -n "$f" ] && git rev-parse --is-inside-work-tree &>/dev/null 2>&1; then - # For tracked files, unstage then restore from git. - # For untracked files (created by reverse-applying "deleted file" - # patches), remove them. - if git ls-files --error-unmatch "$f" &>/dev/null 2>&1; then - git restore --staged -- "$f" 2>/dev/null || true + git restore --staged -- "$f" 2>/dev/null || true + if git show HEAD:"$f" &>/dev/null 2>&1; then + # File exists in HEAD — restore to HEAD state git checkout -- "$f" 2>/dev/null || true else + # File doesn't exist in HEAD (created by patch) — remove rm -f "$f" 2>/dev/null || true fi fi @@ -180,21 +179,47 @@ if [ "$DRY_RUN" = "false" ] && [ "$SKIP_SETUP" = "false" ]; then echo "ERROR: Patch file not found: $PATCH_FILE" >&2 exit 1 fi - # Build base state in a SEPARATE schema so Recce can compare data. - # DuckDB uses one file with multiple schemas. Without a separate base - # schema, value_diff compares dev against itself → 0 differences. - # 1. Build clean state in both dev (current) and prod (base) schemas - # 2. Capture base artifacts from prod - # 3. Apply patch and rebuild dev only - # 4. Recce compares dev (buggy) vs prod (clean) for actual data diffs + # Build base state only for with-plugin variant. + # Baseline gets NO comparison target — it must reason from code + + # single-schema data alone. This mirrors reality: without Recce, + # a developer has no pre-built before/after comparison. + # With-plugin gets both prod (clean) and dev (buggy) schemas so + # Recce MCP tools (value_diff, profile_diff) can compare data. BASE_TARGET="prod" - dbt run --target "$BASE_TARGET" --full-refresh --quiet - dbt docs generate --target-path target-base --target "$BASE_TARGET" --quiet 2>/dev/null || true + if [ "$VARIANT" = "baseline" ]; then + # Drop stale prod schema from prior with-plugin runs. + # In batch mode, interleaved execution (baseline → with-plugin + # per scenario) leaves prod schema in the shared DuckDB file. + # Without this cleanup, baseline gets a free comparison target. + python3 -c " +import os, duckdb +db_path = os.environ.get('JAFFLE_SHOP_DB_PATH', 'data/jaffel-shop.duckdb') +db = duckdb.connect(db_path) +db.execute('DROP SCHEMA IF EXISTS prod CASCADE') +db.close() +" 2>/dev/null || true + fi + if [ "$VARIANT" = "with-plugin" ]; then + dbt run --target "$BASE_TARGET" --full-refresh --quiet + dbt docs generate --target-path target-base --target "$BASE_TARGET" --quiet 2>/dev/null || true + fi # Now apply patch (introduces the bug) and rebuild current state. # Use --full-refresh so incremental models reprocess ALL rows with # the buggy code — otherwise value_diff sees 0 changed rows because # the stored data was computed before the patch was applied. - git apply --reverse --3way "$PATCH_FILE" + # Try --3way first (handles whitespace mismatches), fall back to + # plain apply only for patches that create new files (no base in index). + GIT_APPLY_STDERR=$(mktemp) + if git apply --reverse --3way "$PATCH_FILE" 2>"$GIT_APPLY_STDERR"; then + rm -f "$GIT_APPLY_STDERR" + elif grep -q "does not exist in index" "$GIT_APPLY_STDERR"; then + rm -f "$GIT_APPLY_STDERR" + git apply --reverse "$PATCH_FILE" + else + cat "$GIT_APPLY_STDERR" >&2 + rm -f "$GIT_APPLY_STDERR" + exit 1 + fi dbt run --target "$TARGET" --full-refresh --quiet dbt docs generate --target "$TARGET" --quiet 2>/dev/null || true # Run dbt test BEFORE MCP starts (avoids DuckDB lock conflict). diff --git a/plugins/recce/agents/recce-reviewer.md b/plugins/recce/agents/recce-reviewer.md index 8645974..3e654c5 100644 --- a/plugins/recce/agents/recce-reviewer.md +++ b/plugins/recce/agents/recce-reviewer.md @@ -73,13 +73,15 @@ This single call returns: **Interpret `data_impact` for each model:** - `confirmed`: value_diff verified actual data changes — prioritize for root cause investigation - `none`: value_diff verified NO data changes — safe, note briefly in summary -- `null` (or absent): couldn't run value_diff (views, no PK) — unknown, use profile_diff to assess +- `potential`: value_diff was skipped (views, downstream models, no PK) — **MUST follow up** using the model's `next_action` before classifying as impacted or not_impacted. Do NOT put `potential` models in `not_impacted` without investigation. If `impacted_models` is empty: output the "No impact detected" summary (see Section 4) and STOP. ### Step 2 — Follow-up Investigation -For each entry in `suggested_deep_dives`: +**Priority order**: Models with `data_impact: potential` and a `next_action` field take priority over `suggested_deep_dives`. Follow every `next_action` — these are models where impact is unknown and classification depends on your investigation. + +Then, for remaining entries in `suggested_deep_dives`: **2a. Value diff** — If `value_diff` in impact_analysis shows `rows_changed > 0` or the suggestion mentions value changes, call: ``` @@ -95,7 +97,7 @@ This gives distributions (min, max, mean, nulls, distinct counts) that reveal th - If `columns` is null in the suggestion: call `profile_diff` on the whole model (omit `columns` parameter). - On any MCP error: record "tool skipped for {model}: {error reason}" and continue. -- Limit to the first 3 suggested deep dives to control cost. +- Always follow ALL `next_action` items from `potential` models. For additional `suggested_deep_dives` beyond that, limit to 3 to control cost. ### Step 3 — Root Cause Diagnosis