diff --git a/plugins/recce-dev/skills/recce-eval/SKILL.md b/plugins/recce-dev/skills/recce-eval/SKILL.md
index fc21c18..2d1fcad 100644
--- a/plugins/recce-dev/skills/recce-eval/SKILL.md
+++ b/plugins/recce-dev/skills/recce-eval/SKILL.md
@@ -127,7 +127,7 @@ If the user selects nothing (cancels), **STOP**.
 
 ## Run Flow
 
-This is the core orchestration — 12 steps that set up scenarios, run headless Claude Code, score results, and produce a report.
+This is the core orchestration — 11 steps that set up scenarios, run headless Claude Code, score results, and produce a report.
 
 ### Step 1: Read Scenario(s)
 
@@ -175,7 +175,7 @@ yq -o=json '{
 }' "<scenario-dir>/<id>.yaml"
 ```
 
-When `prompt_template` is non-null (v2), read the template file and substitute vars in Step 5. When `prompt_inline` is non-null (v1), use it directly as the prompt text.
+When `prompt_template` is non-null (v2), `run-batch.sh` uses `render-prompt.py` with template+vars. When `prompt_inline` is non-null (v1), it substitutes runtime variables directly.
 
 ### Step 1b: Clone & Bootstrap v2 Project (v2 only)
 
@@ -195,9 +195,9 @@ eval "$(bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/setup-v2-project.sh
 echo "PROJECT_DIR=$PROJECT_DIR"
 ```
 
-Record `PROJECT_DIR` — pass it as `--project-dir "$PROJECT_DIR"` to all `run-case.sh` invocations in Step 7.
+Record `PROJECT_DIR` — pass it as `--project-dir "$PROJECT_DIR"` to `run-batch.sh` in Step 6.
 
-**Cleanup**: At the very end of the Run Flow (after Step 12), remove the temp project:
+**Cleanup**: At the very end of the Run Flow (after Step 11), remove the temp project:
 
 ```bash
 if [ -n "$WORK_DIR" ] && [[ "$WORK_DIR" == "${TMPDIR:-/tmp}"* ]]; then
@@ -263,23 +263,7 @@ echo "BATCH_DIR=$BATCH_DIR"
 
 Record `EVAL_ID` and `BATCH_DIR` for later steps. `BATCH_DIR` is always absolute and anchored to the invoking CWD, so eval output survives v2 temp project cleanup.
 
-### Step 5: Prepare Prompt
-
-Build the prompt text for each scenario, then write to a temp file.
-
-**v2 (template+vars):** Read the template file from `prompt_template` (relative to `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/`), then substitute `{variables}` with values from `prompt_vars` and runtime values (`{target}`, `{adapter_description}`).
-
-**v1 (inline prompt):** Use the `prompt_inline` string directly, substituting only runtime values (`{target}`, `{adapter_description}`).
-
-```bash
-PROMPT_FILE="/tmp/recce-eval-prompt-${EVAL_ID}-${SCENARIO_ID}.txt"
-cat > "$PROMPT_FILE" << 'PROMPT_EOF'
-<substituted prompt content here>
-PROMPT_EOF
-echo "PROMPT_FILE=$PROMPT_FILE"
-```
-
-### Step 6: Generate Eval MCP Config
+### Step 5: Generate Eval MCP Config
 
 Create a temporary MCP config JSON using **stdio** transport for Recce MCP. This avoids DuckDB lock conflicts — claude spawns the MCP server as a child process after run-case.sh setup completes, so `dbt run` in setup never competes for the database lock.
 
@@ -307,98 +291,53 @@ echo "MCP_CONFIG=/tmp/recce-eval-mcp-config.json"
 
 **Why `--strict-mcp-config`**: The `--mcp-config` flag is additive and its merge behavior with plugin `.mcp.json` for same-name keys is undocumented. Using `--strict-mcp-config` guarantees the eval config is the sole MCP source.
 
-### Step 7: Interleaved Run Loop
+### Step 6: Run Eval Batch
 
-Set `NO_BARE` based on whether the user passed `--no-bare`:
-- If `--no-bare` was passed: `NO_BARE=true` (passes `--no-bare --no-clean-profile` to `run-case.sh`, uses OAuth auth)
-- Otherwise: `NO_BARE=""` (default `--bare` mode, requires `ANTHROPIC_API_KEY`)
+Run all scenarios with both variants using `run-batch.sh`. This script encapsulates prompt rendering, the interleaved run loop, and deterministic scoring into a single background-capable command.
 
-Run each scenario with both variants in interleaved order. For N runs, the execution order is: baseline run1 → with-plugin run1 → baseline run2 → with-plugin run2 → ... This reduces systematic bias from cache warming or temporal effects.
+Build a comma-separated list of absolute scenario file paths from the scenarios parsed in Step 1. Determine the `ADAPTER_DESC` string from the adapter detected in Step 2:
 
-For each run number (1 to N), for each variant (`baseline` first, then `with-plugin`):
+| Adapter | `ADAPTER_DESC` |
+|---------|---------------|
+| duckdb | `DuckDB (local file database, target: $TARGET)` |
+| snowflake | `Snowflake (cloud data warehouse, target: $TARGET)` |
 
 ```bash
-# Create scenario output dir
-mkdir -p "$BATCH_DIR/$SCENARIO_ID"
-
-# ---- Baseline variant ----
-# --bare is default: no memory, no CLAUDE.md, pure prompt-driven evaluation
-# When user passes --no-bare: add --no-bare --no-clean-profile (uses OAuth, no API key needed)
-bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/run-case.sh \
-    --id "$SCENARIO_ID" \
-    --case-type "$CASE_TYPE" \
-    --variant baseline \
-    --prompt-file "$PROMPT_FILE" \
-    --setup-strategy "$SETUP_STRATEGY" \
-    --patch-file "${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/$PATCH_FILE" \
-    --restore-files "$RESTORE_FILES" \
+bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/run-batch.sh \
+    --scenarios "$SCENARIO_LIST" \
+    --batch-dir "$BATCH_DIR" \
+    --eval-id "$EVAL_ID" \
+    --skill-dir "${CLAUDE_PLUGIN_ROOT}/skills/recce-eval" \
+    --recce-plugin "$RECCE_PLUGIN_ROOT" \
     --target "$TARGET" \
-    --max-budget-usd "$MAX_BUDGET" \
-    --output-dir "$BATCH_DIR/$SCENARIO_ID" \
-    --run-number "$RUN_NUM" \
-    ${NO_BARE:+--no-bare --no-clean-profile} \
-    ${PROJECT_DIR:+--project-dir "$PROJECT_DIR"}
-```
-
-Parse the KEY=VALUE output from `run-case.sh`. Record `OUTPUT_FILE`, `JSON_EXTRACTED`, `TOTAL_COST_USD`, `DURATION_MS`.
-
-Immediately score the baseline run:
-
-```bash
-bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/score-deterministic.sh \
-    --run-file "$BATCH_DIR/$SCENARIO_ID/baseline_run${RUN_NUM}.json" \
-    --case-type "$CASE_TYPE" \
-    --ground-truth '$GROUND_TRUTH_JSON'
-```
-
-Then run the with-plugin variant:
-
-```bash
-# ---- With-plugin variant ----
-# --bare is default; --plugin-dir injects the plugin even in bare mode
-# When user passes --no-bare: add --no-bare --no-clean-profile (uses OAuth, no API key needed)
-bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/run-case.sh \
-    --id "$SCENARIO_ID" \
-    --case-type "$CASE_TYPE" \
-    --variant with-plugin \
-    --prompt-file "$PROMPT_FILE" \
-    --setup-strategy "$SETUP_STRATEGY" \
-    --patch-file "${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/$PATCH_FILE" \
-    --restore-files "$RESTORE_FILES" \
-    --target "$TARGET" \
-    --max-budget-usd "$MAX_BUDGET" \
-    --output-dir "$BATCH_DIR/$SCENARIO_ID" \
-    --plugin-dir "$RECCE_PLUGIN_ROOT" \
+    --adapter-desc "$ADAPTER_DESC" \
     --mcp-config /tmp/recce-eval-mcp-config.json \
-    --run-number "$RUN_NUM" \
-    ${NO_BARE:+--no-bare --no-clean-profile} \
+    -n $N \
+    ${MODEL:+--model "$MODEL"} \
+    ${NO_BARE:+--no-bare} \
     ${PROJECT_DIR:+--project-dir "$PROJECT_DIR"}
 ```
 
-Score the with-plugin run:
+Where `$SCENARIO_LIST` is a comma-separated list of absolute paths to scenario YAML files (e.g., `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scenarios/v2/data-001-double-tax-deduction.yaml,${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scenarios/v2/data-002-cogs-food-only.yaml,...`).
 
-```bash
-bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/score-deterministic.sh \
-    --run-file "$BATCH_DIR/$SCENARIO_ID/with-plugin_run${RUN_NUM}.json" \
-    --case-type "$CASE_TYPE" \
-    --ground-truth '$GROUND_TRUTH_JSON'
-```
+**What `run-batch.sh` does internally:**
+1. **Renders prompts** for each scenario (v2: `render-prompt.py` with template+vars; v1: inline with variable substitution)
+2. **Runs interleaved loop**: for each run number (1 to N), for each scenario, baseline → score → with-plugin → score. This interleaving reduces systematic bias from cache warming or temporal effects.
+3. **Scores each run** immediately with `score-deterministic.sh`
+4. **Writes `batch-summary.json`** with run counts, timing, and scenario list
 
-**Important**: The `--ground-truth` value must be a valid JSON string. Extract the `ground_truth` object from the scenario YAML and pass it as a single-quoted JSON string. Example:
-
-```bash
---ground-truth '{"issue_found":true,"root_cause_keywords":["null","left join","coalesce"],"impacted_models":["orders","orders_daily_summary"],"not_impacted_models":["customers","customer_segments","customer_order_pattern"],"affected_row_count":1584,"all_tests_pass":true}'
-```
+**Output files** (all in `$BATCH_DIR`):
+- `<scenario-id>/baseline_run<N>.json` — per-run JSONs with deterministic scores merged in
+- `<scenario-id>/with-plugin_run<N>.json` — per-run JSONs with deterministic scores merged in
+- `batch-summary.json` — machine-readable batch metadata (succeeded/failed counts, duration, scenario list)
 
-**Handling setup.strategy**: When calling `run-case.sh`:
-- If `setup.strategy` is `git_patch`, pass `--patch-file` pointing to `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/<setup.patch_reverse_file>` and `--restore-files` as a comma-separated list from `teardown.restore_files`.
-- If `setup.strategy` is `none`, pass `--setup-strategy none`. Omit `--patch-file` and `--restore-files`.
+**Running in background**: This command can be run via the Bash tool's `run_in_background` parameter for long batches. When complete, proceed to Step 7.
 
-**Error handling**: If `run-case.sh` fails (non-zero exit), log the error and continue to the next run. The teardown trap inside `run-case.sh` handles file restoration automatically. Do not add separate teardown calls here.
+**Error handling**: If a single `run-case.sh` invocation fails, the batch continues to the next run. Failed runs are counted in `batch-summary.json`. The teardown trap inside `run-case.sh` handles file restoration automatically.
 
-Report progress to the user after each run completes: "Run {N} {variant} complete: cost=${cost}, duration=${duration}s, json_extracted={yes/no}".
+**Isolation mode**: `--bare` is the default (both variants get identical isolation). `--no-bare` uses OAuth auth with no API key needed. See Isolation Modes section for details.
 
-### Step 8: Dispatch LLM Judge
+### Step 7: Dispatch LLM Judge
 
 Use the Agent tool to dispatch `recce-dev:eval-judge` with a prompt that includes all the information the judge needs. Group runs by scenario so the judge can compare variants:
 
@@ -426,7 +365,7 @@ If running multiple scenarios, dispatch the judge once per scenario (not once pe
 
 **Error handling**: If the judge agent fails or returns invalid JSON, continue without judge scores. The report will note "LLM judge: unavailable" for affected runs.
 
-### Step 9: Merge Judge Scores
+### Step 8: Merge Judge Scores
 
 Parse the judge's JSON output. For each run entry in the judge's `runs` array, read the corresponding per-run JSON file and merge `scores.llm_judge` into it:
 
@@ -453,7 +392,7 @@ The judge returns scores per run in the format:
 
 Write `comparison_notes` to each run's `scores.llm_judge.comparison_notes` as well.
 
-### Step 10: Write meta.json
+### Step 9: Write meta.json
 
 Write batch metadata to the batch directory:
 
@@ -476,7 +415,7 @@ EOF
 
 Where `$SCENARIOS_JSON_ARRAY` is a JSON array of scenario IDs (e.g., `["data-001-double-tax-deduction", "data-002-cogs-food-only"]`), and `$CLAUDE_MODEL` is from `--model` flag or the current session's model.
 
-### Step 11: Generate Report
+### Step 10: Generate Report
 
 Read all per-run JSONs in the batch directory (now containing both deterministic and judge scores). Follow the structure defined in `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/references/report-template.md`.
 
@@ -498,7 +437,7 @@ The report includes:
 4. **Detailed Scores** with per-run deterministic checks and judge scores
 5. **Cross-Eval Comparison** with historical deltas (if available)
 
-### Step 12: Update History and Print Summary
+### Step 11: Update History and Print Summary
 
 Append a summary entry to `.claude/recce-eval/history.json`:
 
@@ -603,27 +542,27 @@ bash run-case.sh --id ch3-phantom-filter --variant with-plugin \
 
 ## Common Mistakes
 
-- **Shell variables do not persist**: Each Bash tool invocation starts a fresh shell. Re-derive `EVAL_ID`, `BATCH_DIR`, `TARGET`, `ADAPTER`, `RECCE_PLUGIN_ROOT`, and other state in every Bash block that needs them. Do not assume a previous Bash call's variables are available.
+- **Shell variables do not persist**: Each Bash tool invocation starts a fresh shell. Re-derive `EVAL_ID`, `BATCH_DIR`, `TARGET`, `ADAPTER`, `RECCE_PLUGIN_ROOT`, and other state in every Bash block that needs them. Do not assume a previous Bash call's variables are available. Note: `run-batch.sh` eliminates this problem for the run loop (Step 6), but Steps 1-4 and 7-11 still run as separate Bash calls.
 
 - **Forgetting `eval`**: Running `bash resolve-recce-root.sh` without `eval "$(...)"` does not set `RECCE_PLUGIN_ROOT` in the current shell.
 
 - **Platform-specific `md5`**: macOS uses `md5`, Linux uses `md5sum`. The eval scripts handle both — do not simplify to one.
 
-- **MCP config uses `--strict-mcp-config`**: The eval config must be the sole MCP source. `run-case.sh` passes `--strict-mcp-config --mcp-config` so the eval port is guaranteed. The eval config in Step 6 must include both `recce` (eval port) and `recce-docs` (from `$RECCE_PLUGIN_ROOT`).
+- **MCP config uses `--strict-mcp-config`**: The eval config must be the sole MCP source. `run-case.sh` passes `--strict-mcp-config --mcp-config` so the eval port is guaranteed. The eval config in Step 5 must include both `recce` (eval port) and `recce-docs` (from `$RECCE_PLUGIN_ROOT`).
 
 - **`--mcp-config` is variadic**: `--mcp-config <configs...>` consumes subsequent positional arguments. The `--` separator before the prompt in `run-case.sh` prevents the prompt from being parsed as a config argument. Do not remove it.
 
-- **Interleaved order matters**: Run baseline then with-plugin for the same run number before moving to the next run number. Do not group all baselines then all with-plugins — this introduces systematic bias.
+- **Interleaved order matters**: `run-batch.sh` handles this automatically — baseline then with-plugin for each run number. If running manually without `run-batch.sh`, do not group all baselines then all with-plugins — this introduces systematic bias.
 
-- **Teardown is trap-based in run-case.sh**: The script restores files even if `claude -p` fails. Do not add separate teardown calls in the SKILL.md orchestration.
+- **Teardown is trap-based in run-case.sh**: The script restores files even if `claude -p` fails. Do not add separate teardown calls in the orchestration.
 
-- **Ground truth as JSON string**: When passing `--ground-truth` to `score-deterministic.sh`, the value must be a valid JSON string. Use single quotes around the entire JSON value in bash to prevent shell expansion.
+- **Ground truth as JSON string**: `run-batch.sh` handles this automatically via `yq -o=json | jq -c`. If running `score-deterministic.sh` manually, the `--ground-truth` value must be a valid JSON string.
 
 - **Adapter detection uses `yq`**: Do not use grep to parse profiles.yml. The target's adapter type depends on the nested YAML structure which requires proper YAML parsing.
 
 - **stdio MCP needs no lifecycle management**: With stdio transport, claude spawns/kills the MCP server automatically. No `start-eval-mcp.sh` / `stop-eval-mcp.sh` calls needed. The `start-eval-mcp.sh` and `stop-eval-mcp.sh` scripts are retained for SSE mode fallback only.
 
-- **Prompt file per scenario**: When running `--all`, create a separate prompt file for each scenario (use `${EVAL_ID}-${SCENARIO_ID}` in the filename) since each scenario has a different prompt.
+- **Prompt file per scenario**: `run-batch.sh` handles this automatically (naming: `/tmp/recce-eval-prompt-${EVAL_ID}-${SCENARIO_ID}.txt`). If running manually, create a separate prompt file for each scenario.
 
 - **v2 project cleanup**: When `--version v2`, clean up `WORK_DIR` at the end of the Run Flow. Always guard with a `$TMPDIR` prefix check before `rm -rf` to avoid accidental deletion outside temp.
 
@@ -637,9 +576,10 @@ bash run-case.sh --id ch3-phantom-filter --variant with-plugin \
 
 ### Scripts
 
+- **`scripts/run-batch.sh`** — Batch eval runner: renders prompts, runs interleaved loop (baseline→score→with-plugin→score per run per scenario), writes `batch-summary.json`. Background-capable. Encapsulates Steps 5-7 from the original orchestration.
 - **`scripts/list-scenarios.sh`** — List scenarios for a version. Single `yq eval-all` call. Outputs pipe-delimited rows.
-- **`scripts/run-case.sh`** — Atomic runner: setup state, invoke `claude -p`, capture output, teardown, write per-run JSON. Outputs KEY=VALUE lines.
-- **`scripts/score-deterministic.sh`** — jq-based scoring against ground truth. Reads and updates per-run JSON in-place. Outputs KEY=VALUE lines.
+- **`scripts/run-case.sh`** — Atomic runner: setup state, invoke `claude -p`, capture output, teardown, write per-run JSON. Outputs KEY=VALUE lines. Called by `run-batch.sh`.
+- **`scripts/score-deterministic.sh`** — jq-based scoring against ground truth. Reads and updates per-run JSON in-place. Outputs KEY=VALUE lines. Called by `run-batch.sh`.
 - **`scripts/setup-v2-project.sh`** — Clone a dbt project repo to a temp dir and bootstrap (venv, dbt deps, seed). Used by v2 scenarios only. Outputs `PROJECT_DIR=<path>` and `WORK_DIR=<path>`.
 - **`scripts/start-eval-mcp.sh`** — Start Recce MCP server on eval-specific port (default 8085). Retained for SSE mode fallback only.
 - **`scripts/stop-eval-mcp.sh`** — Stop eval MCP server. Retained for SSE mode fallback only.
diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/SCENARIOS.md b/plugins/recce-dev/skills/recce-eval/scenarios/v2/SCENARIOS.md
index c926381..e68a39a 100644
--- a/plugins/recce-dev/skills/recce-eval/scenarios/v2/SCENARIOS.md
+++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/SCENARIOS.md
@@ -237,6 +237,104 @@ where subtotal > 0
 
 ---
 
+## data-007: Supply Cost Breakdown — Hidden Fan-out Cascade
+
+**GitHub Issue**: [#4 — Add Supply Cost Analysis and Perishable Inventory Tracking](https://github.com/DataRecce/jaffle-shop-simulator/issues/4)
+
+**Story**: Purchasing Manager requests perishable vs non-perishable supply cost breakdown per order item. A teammate modifies the `order_supplies_summary` CTE in `order_items.sql` to add `is_perishable_supply` to the GROUP BY.
+
+**Init state (buggy PR)**:
+```sql
+-- order_items.sql — order_supplies_summary CTE
+select
+    product_id,
+    is_perishable_supply,
+    sum(supply_cost) as supply_cost
+from supplies
+group by 1, 2
+```
+
+**The bug**: Adding `is_perishable_supply` to GROUP BY changes the grain from 1 row/product to 2 rows/product (perishable + non-perishable). The downstream `LEFT JOIN` fans out every order_item into 2 rows. This cascades:
+- `order_items`: row count approximately doubles
+- `orders.order_cost`: UNCHANGED (sum of split costs = original total)
+- `orders.count_order_items`: DOUBLED
+- `orders.count_food_items`: DOUBLED (dashboard column!)
+- `orders.count_drink_items`: DOUBLED (dashboard column!)
+- `orders.order_items_subtotal`: DOUBLED (sum of duplicated product_price)
+- `customers`: UNCHANGED (uses order-level columns, not order_items)
+
+**What we expect the agent to find**:
+- Issue found: **yes** — data drift
+- Root cause: grain change in order_supplies_summary fans out the join
+- Impacted: `order_items`, `orders`
+- Not impacted: `stg_orders`, `customers`, `products`, `supplies`
+- Dashboard impact: **yes** (count_food_items, count_drink_items doubled)
+- Detection requires: **data comparison**
+
+**Difficulty**: hard — the grain change looks innocent (adding a dimension), but cascades through orders into dashboard columns
+
+---
+
+## data-008: Numeric Precision Refactor — Zero-Change False Positive Trap
+
+**GitHub Issue**: [#2 — Add Tax Summary Report and Cost Accounting Breakdown](https://github.com/DataRecce/jaffle-shop-simulator/issues/2)
+
+**Story**: Data Engineer wraps all three `cents_to_dollars()` calls in `stg_orders.sql` with `round(..., 2)` for "defensive precision."
+
+**Init state (buggy PR)**:
+```sql
+-- stg_orders.sql
+round({{ cents_to_dollars('subtotal') }}, 2) as subtotal,
+round({{ cents_to_dollars('tax_paid') }}, 2) as tax_paid,
+round({{ cents_to_dollars('order_total') }}, 2) as order_total,
+```
+
+**The bug**: There is NO bug. The `cents_to_dollars` macro already casts to `numeric(16, 2)`. Applying `round(x, 2)` to a value that is already `numeric(16, 2)` is a complete no-op — zero rows change, zero values change across the entire DAG.
+
+**What we expect the agent to find**:
+- Issue found: **no** — the change is a no-op
+- Root cause: round() on already-rounded numeric is redundant
+- Impacted: none
+- Not impacted: `stg_orders`, `orders`, `customers`, `order_items`, `products`
+- Dashboard impact: **no**
+- Detection requires: **data comparison** (to confirm zero change, not just code reasoning)
+
+**Difficulty**: medium — the agent must resist the trap of reporting impact based on DAG reasoning alone (stg_orders is root → everything downstream "could" be affected)
+
+---
+
+## data-009: Date Truncation Change — Month Grain Collapses Daily Timeline
+
+**GitHub Issue**: [#9 — Optimize Date Granularity for Monthly Reporting](https://github.com/DataRecce/jaffle-shop-simulator/issues/9)
+
+**Story**: Analytics Engineer changes `date_trunc` in `stg_orders.sql` from `'day'` to `'month'` to "reduce cardinality and improve query performance."
+
+**Init state (buggy PR)**:
+```sql
+-- stg_orders.sql
+{{ dbt.date_trunc('month','ordered_at') }} as ordered_at
+```
+
+**The bug**: `ordered_at` loses daily granularity — all orders in the same month collapse to the 1st of the month. This propagates through the entire DAG:
+- `orders.ordered_at` — month-level (dashboard column!)
+- `orders.customer_order_number` — ROW_NUMBER by month becomes non-deterministic
+- `order_items.ordered_at` — month-level
+- `customers.first_ordered_at` / `last_ordered_at` — month-level only
+
+Financial columns (subtotal, tax_paid, order_total) are completely unchanged. Row counts are identical — impact is purely value-level on date columns.
+
+**What we expect the agent to find**:
+- Issue found: **yes** — data drift
+- Root cause: date_trunc changed from day to month, collapsing daily granularity
+- Impacted: `stg_orders`, `orders`, `order_items`, `customers`
+- Not impacted: `products`, `supplies`, `locations`
+- Dashboard impact: **yes** (ordered_at is a dashboard column)
+- Detection requires: **data comparison**
+
+**Difficulty**: medium — the agent must correctly scope impact to date columns only and avoid false positives on financial metrics
+
+---
+
 ## Summary Matrix
 
 | ID | Bug Type | Modified/New | Difficulty | Detection | Dashboard? | Affected Rows |
@@ -247,4 +345,7 @@ where subtotal > 0
 | data-004 | Count ratio vs cost ratio | New `supply_analysis` | medium | data comparison | no | all rows |
 | data-005 | current_date on historical data | New `customer_segments` | easy | data comparison | no | all rows |
 | data-006 | Tax instead of COGS in formula | New `financial_orders` | easy | data comparison | no | all rows |
+| data-007 | Grain fan-out cascades to dashboard | Modified `order_items` | hard | data comparison | yes | all rows (doubled) |
+| data-008 | No-op precision change (false positive trap) | Modified `stg_orders` | medium | data comparison | no | 0 |
+| data-009 | Date grain collapse (day→month) | Modified `stg_orders` | medium | data comparison | yes | 658,657 |
 | code-001 | Wrong filter column (spec deviation) | Modified `stg_orders` | hard | code review | no | 4,155 |
diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-007-supply-grain-fanout.yaml b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-007-supply-grain-fanout.yaml
new file mode 100644
index 0000000..5cca3ba
--- /dev/null
+++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-007-supply-grain-fanout.yaml
@@ -0,0 +1,72 @@
+id: data-007-supply-grain-fanout
+name: "Supply Cost Breakdown — Hidden Fan-out Cascade"
+description: "order_items supply summary adds is_perishable_supply to GROUP BY — grain change fans out join, doubling count columns through orders mart into dashboard"
+github_issue: https://github.com/DataRecce/jaffle-shop-simulator/issues/4
+layer: review
+difficulty: hard
+stakeholder: purchasing
+case_type: problem_exists
+
+story: |
+  The Purchasing Manager (P2) requested a breakdown of perishable vs non-perishable supply
+  costs per order item, to better understand spoilage risk in the supply chain.
+
+  A teammate modified the `order_supplies_summary` CTE in `order_items.sql` to include
+  `is_perishable_supply` in the GROUP BY and SELECT. This splits each product's supply cost
+  into two rows: one for perishable supplies, one for non-perishable supplies.
+
+  The code change looks reasonable — adding a dimension to an aggregation. But it changes
+  the grain of `order_supplies_summary` from 1 row per product to 2 rows per product.
+  The downstream LEFT JOIN in the `joined` CTE now produces 2 rows per order_item (one for
+  each perishable category). This fan-out cascades:
+
+  - `order_items`: row count approximately doubles
+  - `orders.order_cost`: UNCHANGED (sum of split costs = original total)
+  - `orders.count_order_items`: DOUBLED (counts duplicated rows)
+  - `orders.count_food_items`: DOUBLED (dashboard column!)
+  - `orders.count_drink_items`: DOUBLED (dashboard column!)
+  - `orders.order_items_subtotal`: DOUBLED (sum of duplicated product_price)
+  - `customers`: UNCHANGED (aggregates use order-level columns from stg_orders, not order_items)
+
+  The bug is a classic grain mismatch hidden behind an innocent-looking GROUP BY change.
+
+environment:
+  repo: DataRecce/jaffle-shop-simulator
+  ref: eval-base
+  adapter: duckdb
+
+setup:
+  strategy: git_patch
+  patch_reverse_file: scenarios/v2/patches/data-007-supply-grain-fanout.patch
+  skip_context: false
+
+prompt:
+  template: prompts/review.md
+  vars:
+    stakeholder_name: "Purchasing Manager (P2)"
+    stakeholder_request: "Add perishable vs non-perishable supply cost breakdown per order item for spoilage risk analysis"
+    pr_description: "Add is_perishable_supply dimension to order_items supply cost aggregation — splits supply_cost into perishable and non-perishable components"
+
+headless:
+  max_budget_usd: 5.00
+  output_format: json
+
+ground_truth:
+  issue_found: true
+  issue_type: data_drift
+  root_cause_keywords: ["grain", "fan-out", "group by", "is_perishable_supply", "duplicate", "count", "order_supplies_summary", "double"]
+  impacted_models: ["order_items", "orders"]
+  not_impacted_models: ["stg_orders", "customers", "products", "supplies"]
+  dashboard_impact: true
+  detection_requires: data_comparison
+
+judge_criteria:
+  - "Agent identifies the grain change in order_supplies_summary (1 row/product → 2 rows/product)"
+  - "Agent recognizes the fan-out cascade: order_items rows doubled → orders count columns doubled"
+  - "Agent notes that order_cost (sum of supply_cost) is UNCHANGED despite the fan-out — sum of parts equals the original total"
+  - "Agent identifies that count_food_items and count_drink_items are DOUBLED — these are Executive Dashboard columns"
+  - "Agent correctly identifies that customers model is NOT impacted"
+  - "Agent correctly identifies dashboard_impact as true (count_food_items, count_drink_items)"
+
+teardown:
+  restore_files: ["models/marts/order_items.sql"]
diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-008-precision-noop.yaml b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-008-precision-noop.yaml
new file mode 100644
index 0000000..b2195d8
--- /dev/null
+++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-008-precision-noop.yaml
@@ -0,0 +1,69 @@
+id: data-008-precision-noop
+name: "Numeric Precision Refactor — Zero-Change False Positive Trap"
+description: "stg_orders wraps cents_to_dollars with round(x, 2) — macro already outputs numeric(16,2) so data is identical, but code diff touches root staging model"
+github_issue: https://github.com/DataRecce/jaffle-shop-simulator/issues/2
+layer: review
+difficulty: medium
+stakeholder: data-engineering
+case_type: no_problem
+
+story: |
+  A Data Engineer noticed that the `cents_to_dollars` macro returns `::numeric(16, 2)` but
+  wanted to make the precision "explicit and defensive" by wrapping all three money columns
+  in `stg_orders.sql` with `round(..., 2)`.
+
+  The PR description says: "Add explicit rounding to money columns for precision safety —
+  ensures no floating point drift in downstream aggregations."
+
+  The change modifies `stg_orders.sql`, which is the ROOT staging model feeding into
+  `orders`, `customers`, and every downstream mart. A code-only reviewer seeing a change
+  to the root financial staging model would reasonably flag this as high-risk and report
+  potential impact on all downstream models.
+
+  However, `cents_to_dollars` already casts to `numeric(16, 2)`. Applying `round(x, 2)` to
+  a value that is already `numeric(16, 2)` is a complete no-op — zero rows change, zero
+  values change, zero downstream impact. The correct assessment is: no issue found.
+
+  This scenario tests whether the agent can use data comparison to CONFIRM safety rather
+  than relying on DAG reasoning alone (which would produce false positives).
+
+environment:
+  repo: DataRecce/jaffle-shop-simulator
+  ref: eval-base
+  adapter: duckdb
+
+setup:
+  strategy: git_patch
+  patch_reverse_file: scenarios/v2/patches/data-008-precision-noop.patch
+  skip_context: false
+
+prompt:
+  template: prompts/review.md
+  vars:
+    stakeholder_name: "Data Engineer (P3)"
+    stakeholder_request: "Add explicit rounding to money columns in stg_orders for precision safety"
+    pr_description: "Wrap cents_to_dollars output with round(x, 2) in stg_orders — defensive precision for downstream financial aggregations"
+
+headless:
+  max_budget_usd: 5.00
+  output_format: json
+
+ground_truth:
+  issue_found: false
+  issue_type: no_issue
+  root_cause_keywords: ["no-op", "round", "numeric", "precision", "already", "identical", "no change", "zero"]
+  impacted_models: []
+  not_impacted_models: ["stg_orders", "orders", "customers", "order_items", "products"]
+  dashboard_impact: false
+  detection_requires: data_comparison
+
+judge_criteria:
+  - "Agent verifies through DATA comparison that all downstream models have zero value changes"
+  - "Agent recognizes that round(numeric(16,2), 2) is a no-op — the macro already handles precision"
+  - "Agent does NOT report false positives on orders, customers, or other downstream models"
+  - "Agent correctly concludes issue_found: false — no data impact despite code change to root model"
+  - "Agent correctly identifies dashboard_impact as false"
+  - "Agent avoids the trap of DAG-based reasoning alone (stg_orders is root → everything must be impacted)"
+
+teardown:
+  restore_files: ["models/staging/stg_orders.sql"]
diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-009-date-grain-month.yaml b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-009-date-grain-month.yaml
new file mode 100644
index 0000000..bfd1796
--- /dev/null
+++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/data-009-date-grain-month.yaml
@@ -0,0 +1,78 @@
+id: data-009-date-grain-month
+name: "Date Truncation Change — Month Grain Collapses Daily Timeline"
+description: "stg_orders changes date_trunc from day to month — ordered_at loses daily granularity across entire DAG, but financial columns are unchanged"
+github_issue: https://github.com/DataRecce/jaffle-shop-simulator/issues/9
+layer: review
+difficulty: medium
+stakeholder: analytics
+case_type: problem_exists
+
+story: |
+  An Analytics Engineer proposed changing the date truncation in `stg_orders.sql` from
+  `day` to `month` to "reduce cardinality and improve query performance for monthly
+  reporting dashboards."
+
+  The PR modifies one line in `stg_orders.sql`:
+  - Before: `date_trunc('day', ordered_at)`
+  - After: `date_trunc('month', ordered_at)`
+
+  The change compiles fine and all dbt tests pass. The PR description argues this is a
+  harmless optimization since "most reports aggregate to monthly anyway."
+
+  However, `stg_orders` is the ROOT staging model for the entire orders pipeline. The
+  `ordered_at` column propagates through:
+  - `orders.ordered_at` — now month-level (dashboard column!)
+  - `orders.customer_order_number` — ROW_NUMBER ordered by month becomes non-deterministic
+    for orders within the same month
+  - `order_items.ordered_at` — joined from stg_orders, now month-level
+  - `customers.first_ordered_at` — now month-level only (loses day precision)
+  - `customers.last_ordered_at` — now month-level only (loses day precision)
+
+  Critically, financial columns (subtotal, tax_paid, order_total, order_cost) are
+  COMPLETELY UNCHANGED. The agent must correctly scope the impact to date/time columns
+  only and avoid false positives on financial metrics.
+
+  Row counts are identical across all models — no rows added or removed. The impact is
+  purely in value changes to the ordered_at column and its derivatives.
+
+environment:
+  repo: DataRecce/jaffle-shop-simulator
+  ref: eval-base
+  adapter: duckdb
+
+setup:
+  strategy: git_patch
+  patch_reverse_file: scenarios/v2/patches/data-009-date-grain-month.patch
+  skip_context: false
+
+prompt:
+  template: prompts/review.md
+  vars:
+    stakeholder_name: "Analytics Engineer (P3)"
+    stakeholder_request: "Optimize date granularity in stg_orders from daily to monthly for reporting performance"
+    pr_description: "Change date_trunc from day to month in stg_orders — reduces ordered_at cardinality for faster monthly aggregations"
+
+headless:
+  max_budget_usd: 5.00
+  output_format: json
+
+ground_truth:
+  issue_found: true
+  issue_type: data_drift
+  root_cause_keywords: ["date_trunc", "month", "day", "ordered_at", "granularity", "precision", "cardinality"]
+  impacted_models: ["stg_orders", "orders", "order_items", "customers"]
+  not_impacted_models: ["products", "supplies", "locations"]
+  dashboard_impact: true
+  detection_requires: data_comparison
+
+judge_criteria:
+  - "Agent identifies that ordered_at loses daily granularity — collapses to month-level across the DAG"
+  - "Agent correctly identifies dashboard_impact as true (ordered_at is a dashboard column)"
+  - "Agent correctly identifies that financial columns (subtotal, tax_paid, order_total) are UNCHANGED"
+  - "Agent correctly scopes impacted_models to those that use ordered_at: stg_orders, orders, order_items, customers"
+  - "Agent does NOT falsely report products, supplies, or locations as impacted"
+  - "Agent notes that customer_order_number becomes non-deterministic for same-month orders"
+  - "Agent recognizes row counts are unchanged — the impact is value-level, not row-level"
+
+teardown:
+  restore_files: ["models/staging/stg_orders.sql"]
diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-007-supply-grain-fanout.patch b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-007-supply-grain-fanout.patch
new file mode 100644
index 0000000..900c962
--- /dev/null
+++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-007-supply-grain-fanout.patch
@@ -0,0 +1,26 @@
+diff --git a/models/marts/order_items.sql b/models/marts/order_items.sql
+--- a/models/marts/order_items.sql
++++ b/models/marts/order_items.sql
+@@ -29,13 +29,12 @@
+
+     select
+         product_id,
+-        is_perishable_supply,
+
+         sum(supply_cost) as supply_cost
+
+     from supplies
+
+-    group by 1, 2
++    group by 1
+
+ ),
+
+@@ -51,7 +50,6 @@
+         products.is_food_item,
+         products.is_drink_item,
+
+-        order_supplies_summary.is_perishable_supply,
+         order_supplies_summary.supply_cost
+
+     from order_items
diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-008-precision-noop.patch b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-008-precision-noop.patch
new file mode 100644
index 0000000..acb6c63
--- /dev/null
+++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-008-precision-noop.patch
@@ -0,0 +1,16 @@
+diff --git a/models/staging/stg_orders.sql b/models/staging/stg_orders.sql
+--- a/models/staging/stg_orders.sql
++++ b/models/staging/stg_orders.sql
+@@ -19,9 +19,9 @@
+         subtotal as subtotal_cents,
+         tax_paid as tax_paid_cents,
+         order_total as order_total_cents,
+-        round({{ cents_to_dollars('subtotal') }}, 2) as subtotal,
+-        round({{ cents_to_dollars('tax_paid') }}, 2) as tax_paid,
+-        round({{ cents_to_dollars('order_total') }}, 2) as order_total,
++        {{ cents_to_dollars('subtotal') }} as subtotal,
++        {{ cents_to_dollars('tax_paid') }} as tax_paid,
++        {{ cents_to_dollars('order_total') }} as order_total,
+
+         ---------- timestamps
+         {{ dbt.date_trunc('day','ordered_at') }} as ordered_at
diff --git a/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-009-date-grain-month.patch b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-009-date-grain-month.patch
new file mode 100644
index 0000000..188f689
--- /dev/null
+++ b/plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/data-009-date-grain-month.patch
@@ -0,0 +1,12 @@
+diff --git a/models/staging/stg_orders.sql b/models/staging/stg_orders.sql
+--- a/models/staging/stg_orders.sql
++++ b/models/staging/stg_orders.sql
+@@ -24,7 +24,7 @@
+         {{ cents_to_dollars('order_total') }} as order_total,
+ 
+         ---------- timestamps
+-        {{ dbt.date_trunc('month','ordered_at') }} as ordered_at
++        {{ dbt.date_trunc('day','ordered_at') }} as ordered_at
+ 
+     from source
+ 
diff --git a/plugins/recce-dev/skills/recce-eval/scripts/run-batch.sh b/plugins/recce-dev/skills/recce-eval/scripts/run-batch.sh
new file mode 100755
index 0000000..ae76b06
--- /dev/null
+++ b/plugins/recce-dev/skills/recce-eval/scripts/run-batch.sh
@@ -0,0 +1,262 @@
+#!/bin/bash
+# run-batch.sh — Batch eval runner: render prompts → interleaved run loop → score
+#
+# Encapsulates SKILL.md Steps 5-7 into a single background-capable script.
+# Caller (SKILL.md) handles Steps 1-4 (parse scenarios, bootstrap project,
+# detect adapter, create batch dir) and Steps 8-12 (judge, meta, report).
+#
+# Usage:
+#   bash run-batch.sh \
+#     --scenarios scenario1.yaml,scenario2.yaml \
+#     --batch-dir /path/to/batch \
+#     --eval-id 20260404-1530 \
+#     --skill-dir /path/to/recce-eval \
+#     --recce-plugin /path/to/recce-plugin \
+#     --target dev \
+#     --adapter-desc "DuckDB (local file database, target: dev)" \
+#     [--mcp-config /tmp/mcp.json] \
+#     [-n 3] [--model claude-sonnet-4-20250514] [--mode real-world] \
+#     [--no-bare] [--project-dir /path/to/project]
+#
+# Output:
+#   - Per-run JSONs: $BATCH_DIR/<scenario-id>/<variant>_run<N>.json (via run-case.sh)
+#   - Deterministic scores merged into per-run JSONs (via score-deterministic.sh)
+#   - Batch summary: $BATCH_DIR/batch-summary.json
+#   - Progress lines to stdout
+set -euo pipefail
+
+# ========== Argument Parsing ==========
+SCENARIOS="" BATCH_DIR="" EVAL_ID="" SKILL_DIR="" RECCE_PLUGIN=""
+TARGET="" ADAPTER_DESC="" MCP_CONFIG="" RUNS=1 MODEL="" MODE="real-world"
+NO_BARE="" PROJECT_DIR=""
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --scenarios)     SCENARIOS="$2";     shift 2 ;;
+        --batch-dir)     BATCH_DIR="$2";     shift 2 ;;
+        --eval-id)       EVAL_ID="$2";       shift 2 ;;
+        --skill-dir)     SKILL_DIR="$2";     shift 2 ;;
+        --recce-plugin)  RECCE_PLUGIN="$2";  shift 2 ;;
+        --target)        TARGET="$2";        shift 2 ;;
+        --adapter-desc)  ADAPTER_DESC="$2";  shift 2 ;;
+        --mcp-config)    MCP_CONFIG="$2";    shift 2 ;;
+        -n|--runs)       RUNS="$2";          shift 2 ;;
+        --model)         MODEL="$2";         shift 2 ;;
+        --mode)          MODE="$2";          shift 2 ;;
+        --no-bare)       NO_BARE="true";     shift 1 ;;
+        --project-dir)   PROJECT_DIR="$2";   shift 2 ;;
+        *) echo "ERROR: Unknown argument: $1" >&2; exit 1 ;;
+    esac
+done
+
+# ========== Validation ==========
+MISSING=""
+[ -z "$SCENARIOS" ]    && MISSING="$MISSING --scenarios"
+[ -z "$BATCH_DIR" ]    && MISSING="$MISSING --batch-dir"
+[ -z "$EVAL_ID" ]      && MISSING="$MISSING --eval-id"
+[ -z "$SKILL_DIR" ]    && MISSING="$MISSING --skill-dir"
+[ -z "$RECCE_PLUGIN" ] && MISSING="$MISSING --recce-plugin"
+[ -z "$TARGET" ]       && MISSING="$MISSING --target"
+[ -z "$ADAPTER_DESC" ] && MISSING="$MISSING --adapter-desc"
+
+if [ -n "$MISSING" ]; then
+    echo "ERROR: Missing required arguments:$MISSING" >&2
+    exit 1
+fi
+
+for cmd in yq jq python3; do
+    if ! command -v "$cmd" &>/dev/null; then
+        echo "ERROR: Required command not found: $cmd" >&2
+        exit 1
+    fi
+done
+
+IFS=',' read -ra SCENARIO_FILES_RAW <<< "$SCENARIOS"
+SCENARIO_FILES=()
+for f in "${SCENARIO_FILES_RAW[@]}"; do
+    f="${f#"${f%%[![:space:]]*}"}"  # trim leading whitespace
+    f="${f%"${f##*[![:space:]]}"}"  # trim trailing whitespace
+    if [ ! -f "$f" ]; then
+        echo "ERROR: Scenario file not found: $f" >&2
+        exit 1
+    fi
+    SCENARIO_FILES+=("$f")
+done
+
+mkdir -p "$BATCH_DIR"
+
+# ========== Phase 1: Render Prompts ==========
+echo "=== Phase 1: Rendering prompts for ${#SCENARIO_FILES[@]} scenarios ==="
+
+for SCENARIO_FILE in "${SCENARIO_FILES[@]}"; do
+    SCENARIO_ID=$(yq -r '.id' "$SCENARIO_FILE")
+    TEMPLATE=$(yq -r '.prompt.template // ""' "$SCENARIO_FILE")
+    PROMPT_FILE="/tmp/recce-eval-prompt-${EVAL_ID}-${SCENARIO_ID}.txt"
+
+    if [ -n "$TEMPLATE" ]; then
+        # v2: template + vars substituted by render-prompt.py
+        python3 "${SKILL_DIR}/scripts/render-prompt.py" \
+            "${SKILL_DIR}/${TEMPLATE}" "$SCENARIO_FILE" \
+            --var "adapter_description=${ADAPTER_DESC}" \
+            --var "target=${TARGET}" \
+            > "$PROMPT_FILE"
+    else
+        # v1: inline prompt with runtime variable substitution
+        PROMPT_TEXT=$(yq -r '.prompt' "$SCENARIO_FILE")
+        PROMPT_TEXT="${PROMPT_TEXT//\{adapter_description\}/$ADAPTER_DESC}"
+        PROMPT_TEXT="${PROMPT_TEXT//\{target\}/$TARGET}"
+        printf '%s' "$PROMPT_TEXT" > "$PROMPT_FILE"
+    fi
+
+    echo "  [ok] $SCENARIO_ID"
+done
+
+# ========== Phase 2: Interleaved Run Loop ==========
+# Order: for each run_num → for each scenario → baseline then with-plugin.
+# Interleaving reduces systematic bias from cache warming or temporal effects.
+TOTAL_RUNS=$(( ${#SCENARIO_FILES[@]} * RUNS * 2 ))
+echo ""
+echo "=== Phase 2: Running $TOTAL_RUNS cases (${#SCENARIO_FILES[@]} scenarios x $RUNS runs x 2 variants) ==="
+echo ""
+
+RUN_INDEX=0
+SUCCEEDED=0
+FAILED=0
+BATCH_START=$(date +%s)
+
+for (( run_num=1; run_num<=RUNS; run_num++ )); do
+    for SCENARIO_FILE in "${SCENARIO_FILES[@]}"; do
+        # Parse scenario metadata once per scenario per run_num
+        SCENARIO_ID=$(yq -r '.id' "$SCENARIO_FILE")
+        CASE_TYPE=$(yq -r '.case_type' "$SCENARIO_FILE")
+        SETUP_STRATEGY=$(yq -r '.setup.strategy' "$SCENARIO_FILE")
+        PATCH_REL=$(yq -r '.setup.patch_reverse_file // ""' "$SCENARIO_FILE")
+        SKIP_CTX=$(yq -r '.setup.skip_context // "false"' "$SCENARIO_FILE")
+        RESTORE=$(yq -r '.teardown.restore_files // [] | join(",")' "$SCENARIO_FILE")
+        MAX_BUDGET=$(yq -r '.headless.max_budget_usd' "$SCENARIO_FILE")
+        GT_JSON=$(yq -o=json '.ground_truth' "$SCENARIO_FILE" | jq -c .)
+        PROMPT_FILE="/tmp/recce-eval-prompt-${EVAL_ID}-${SCENARIO_ID}.txt"
+
+        mkdir -p "$BATCH_DIR/$SCENARIO_ID"
+
+        for VARIANT in baseline with-plugin; do
+            RUN_INDEX=$(( RUN_INDEX + 1 ))
+            CASE_START=$(date +%s)
+            echo "[${RUN_INDEX}/${TOTAL_RUNS}] ${SCENARIO_ID} ${VARIANT} run${run_num} — starting"
+
+            # Build run-case.sh argument list
+            RUN_ARGS=(
+                --id "$SCENARIO_ID"
+                --case-type "$CASE_TYPE"
+                --variant "$VARIANT"
+                --prompt-file "$PROMPT_FILE"
+                --setup-strategy "$SETUP_STRATEGY"
+                --target "$TARGET"
+                --max-budget-usd "$MAX_BUDGET"
+                --output-dir "$BATCH_DIR/$SCENARIO_ID"
+                --run-number "$run_num"
+            )
+
+            # Isolation mode: --bare (default) or --no-bare
+            if [ -z "$NO_BARE" ]; then
+                RUN_ARGS+=(--bare)
+            else
+                RUN_ARGS+=(--no-bare --no-clean-profile)
+            fi
+
+            # Patch file (only for git_patch strategy)
+            if [ "$SETUP_STRATEGY" = "git_patch" ] && [ -n "$PATCH_REL" ] && [ "$PATCH_REL" != "null" ]; then
+                RUN_ARGS+=(--patch-file "${SKILL_DIR}/${PATCH_REL}")
+            fi
+            [ -n "$RESTORE" ] && RUN_ARGS+=(--restore-files "$RESTORE")
+
+            # With-plugin variant: inject plugin + MCP
+            if [ "$VARIANT" = "with-plugin" ]; then
+                RUN_ARGS+=(--plugin-dir "$RECCE_PLUGIN")
+                [ -n "$MCP_CONFIG" ] && RUN_ARGS+=(--mcp-config "$MCP_CONFIG")
+            fi
+
+            # Optional flags
+            [ -n "$MODEL" ] && RUN_ARGS+=(--model "$MODEL")
+            RUN_ARGS+=(--mode "$MODE")
+            [ -n "$PROJECT_DIR" ] && RUN_ARGS+=(--project-dir "$PROJECT_DIR")
+            [ "$SKIP_CTX" = "true" ] && RUN_ARGS+=(--skip-setup-context)
+
+            # Execute run-case.sh
+            RUN_FILE="$BATCH_DIR/$SCENARIO_ID/${VARIANT}_run${run_num}.json"
+            RUN_OUTPUT=""
+            if RUN_OUTPUT=$(bash "${SKILL_DIR}/scripts/run-case.sh" "${RUN_ARGS[@]}" 2>&1); then
+                # Parse KEY=VALUE output from run-case.sh
+                COST=$(echo "$RUN_OUTPUT" | grep "^TOTAL_COST_USD=" | cut -d= -f2 || echo "?")
+                DURATION=$(echo "$RUN_OUTPUT" | grep "^DURATION_MS=" | cut -d= -f2 || echo "0")
+                JSON_OK=$(echo "$RUN_OUTPUT" | grep "^JSON_EXTRACTED=" | cut -d= -f2 || echo "?")
+
+                # Score immediately after each run
+                SCORE_OUTPUT=""
+                if SCORE_OUTPUT=$(bash "${SKILL_DIR}/scripts/score-deterministic.sh" \
+                    --run-file "$RUN_FILE" \
+                    --case-type "$CASE_TYPE" \
+                    --ground-truth "$GT_JSON" 2>&1); then
+                    PASS_RATE=$(echo "$SCORE_OUTPUT" | grep "^PASS_RATE=" | cut -d= -f2 || echo "?")
+                else
+                    PASS_RATE="score-error"
+                fi
+
+                DURATION_SEC=$(( ${DURATION:-0} / 1000 ))
+                echo "[${RUN_INDEX}/${TOTAL_RUNS}] ${SCENARIO_ID} ${VARIANT} run${run_num} — DONE cost=\$${COST} duration=${DURATION_SEC}s json=${JSON_OK} pass_rate=${PASS_RATE}"
+                SUCCEEDED=$(( SUCCEEDED + 1 ))
+            else
+                CASE_END=$(date +%s)
+                WALL_SEC=$(( CASE_END - CASE_START ))
+                echo "[${RUN_INDEX}/${TOTAL_RUNS}] ${SCENARIO_ID} ${VARIANT} run${run_num} — FAILED after ${WALL_SEC}s"
+                echo "$RUN_OUTPUT" | tail -3 | sed 's/^/  > /'
+                FAILED=$(( FAILED + 1 ))
+            fi
+        done
+    done
+done
+
+# ========== Phase 3: Summary ==========
+BATCH_END=$(date +%s)
+BATCH_DURATION=$(( BATCH_END - BATCH_START ))
+BATCH_MINUTES=$(( BATCH_DURATION / 60 ))
+BATCH_SECONDS=$(( BATCH_DURATION % 60 ))
+
+echo ""
+echo "=== BATCH COMPLETE ==="
+echo "Eval ID:    $EVAL_ID"
+echo "Succeeded:  $SUCCEEDED / $TOTAL_RUNS"
+echo "Failed:     $FAILED / $TOTAL_RUNS"
+echo "Duration:   ${BATCH_MINUTES}m ${BATCH_SECONDS}s"
+echo "Output:     $BATCH_DIR"
+
+# Write machine-readable summary for SKILL.md Steps 8-12
+SCENARIO_IDS="[]"
+for SCENARIO_FILE in "${SCENARIO_FILES[@]}"; do
+    SID=$(yq -r '.id' "$SCENARIO_FILE")
+    SCENARIO_IDS=$(echo "$SCENARIO_IDS" | jq --arg id "$SID" '. + [$id]')
+done
+
+jq -n \
+    --arg eval_id "$EVAL_ID" \
+    --argjson total "$TOTAL_RUNS" \
+    --argjson succeeded "$SUCCEEDED" \
+    --argjson failed "$FAILED" \
+    --argjson runs "$RUNS" \
+    --argjson scenarios "$SCENARIO_IDS" \
+    --arg batch_dir "$BATCH_DIR" \
+    --arg completed_at "$(date -u +"%Y-%m-%dT%H:%M:%SZ")" \
+    --argjson duration_sec "$BATCH_DURATION" \
+    '{
+        eval_id: $eval_id,
+        total_runs: $total,
+        succeeded: $succeeded,
+        failed: $failed,
+        runs_per_scenario: $runs,
+        scenarios: $scenarios,
+        batch_dir: $batch_dir,
+        completed_at: $completed_at,
+        duration_sec: $duration_sec
+    }' > "$BATCH_DIR/batch-summary.json"
+
+echo "Summary:    $BATCH_DIR/batch-summary.json"
diff --git a/plugins/recce-dev/skills/recce-eval/scripts/run-case.sh b/plugins/recce-dev/skills/recce-eval/scripts/run-case.sh
index 6b0047d..9aeab4d 100755
--- a/plugins/recce-dev/skills/recce-eval/scripts/run-case.sh
+++ b/plugins/recce-dev/skills/recce-eval/scripts/run-case.sh
@@ -153,13 +153,12 @@ cleanup() {
             f="${f#"${f%%[![:space:]]*}"}"  # trim leading whitespace
             f="${f%"${f##*[![:space:]]}"}"  # trim trailing whitespace
             if [ -n "$f" ] && git rev-parse --is-inside-work-tree &>/dev/null 2>&1; then
-                # For tracked files, unstage then restore from git.
-                # For untracked files (created by reverse-applying "deleted file"
-                # patches), remove them.
-                if git ls-files --error-unmatch "$f" &>/dev/null 2>&1; then
-                    git restore --staged -- "$f" 2>/dev/null || true
+                git restore --staged -- "$f" 2>/dev/null || true
+                if git show HEAD:"$f" &>/dev/null 2>&1; then
+                    # File exists in HEAD — restore to HEAD state
                     git checkout -- "$f" 2>/dev/null || true
                 else
+                    # File doesn't exist in HEAD (created by patch) — remove
                     rm -f "$f" 2>/dev/null || true
                 fi
             fi
@@ -180,21 +179,47 @@ if [ "$DRY_RUN" = "false" ] && [ "$SKIP_SETUP" = "false" ]; then
                 echo "ERROR: Patch file not found: $PATCH_FILE" >&2
                 exit 1
             fi
-            # Build base state in a SEPARATE schema so Recce can compare data.
-            # DuckDB uses one file with multiple schemas. Without a separate base
-            # schema, value_diff compares dev against itself → 0 differences.
-            # 1. Build clean state in both dev (current) and prod (base) schemas
-            # 2. Capture base artifacts from prod
-            # 3. Apply patch and rebuild dev only
-            # 4. Recce compares dev (buggy) vs prod (clean) for actual data diffs
+            # Build base state only for with-plugin variant.
+            # Baseline gets NO comparison target — it must reason from code +
+            # single-schema data alone. This mirrors reality: without Recce,
+            # a developer has no pre-built before/after comparison.
+            # With-plugin gets both prod (clean) and dev (buggy) schemas so
+            # Recce MCP tools (value_diff, profile_diff) can compare data.
             BASE_TARGET="prod"
-            dbt run --target "$BASE_TARGET" --full-refresh --quiet
-            dbt docs generate --target-path target-base --target "$BASE_TARGET" --quiet 2>/dev/null || true
+            if [ "$VARIANT" = "baseline" ]; then
+                # Drop stale prod schema from prior with-plugin runs.
+                # In batch mode, interleaved execution (baseline → with-plugin
+                # per scenario) leaves prod schema in the shared DuckDB file.
+                # Without this cleanup, baseline gets a free comparison target.
+                python3 -c "
+import os, duckdb
+db_path = os.environ.get('JAFFLE_SHOP_DB_PATH', 'data/jaffel-shop.duckdb')
+db = duckdb.connect(db_path)
+db.execute('DROP SCHEMA IF EXISTS prod CASCADE')
+db.close()
+" 2>/dev/null || true
+            fi
+            if [ "$VARIANT" = "with-plugin" ]; then
+                dbt run --target "$BASE_TARGET" --full-refresh --quiet
+                dbt docs generate --target-path target-base --target "$BASE_TARGET" --quiet 2>/dev/null || true
+            fi
             # Now apply patch (introduces the bug) and rebuild current state.
             # Use --full-refresh so incremental models reprocess ALL rows with
             # the buggy code — otherwise value_diff sees 0 changed rows because
             # the stored data was computed before the patch was applied.
-            git apply --reverse --3way "$PATCH_FILE"
+            # Try --3way first (handles whitespace mismatches), fall back to
+            # plain apply only for patches that create new files (no base in index).
+            GIT_APPLY_STDERR=$(mktemp)
+            if git apply --reverse --3way "$PATCH_FILE" 2>"$GIT_APPLY_STDERR"; then
+                rm -f "$GIT_APPLY_STDERR"
+            elif grep -q "does not exist in index" "$GIT_APPLY_STDERR"; then
+                rm -f "$GIT_APPLY_STDERR"
+                git apply --reverse "$PATCH_FILE"
+            else
+                cat "$GIT_APPLY_STDERR" >&2
+                rm -f "$GIT_APPLY_STDERR"
+                exit 1
+            fi
             dbt run --target "$TARGET" --full-refresh --quiet
             dbt docs generate --target "$TARGET" --quiet 2>/dev/null || true
             # Run dbt test BEFORE MCP starts (avoids DuckDB lock conflict).
diff --git a/plugins/recce/agents/recce-reviewer.md b/plugins/recce/agents/recce-reviewer.md
index 8645974..3e654c5 100644
--- a/plugins/recce/agents/recce-reviewer.md
+++ b/plugins/recce/agents/recce-reviewer.md
@@ -73,13 +73,15 @@ This single call returns:
 **Interpret `data_impact` for each model:**
 - `confirmed`: value_diff verified actual data changes — prioritize for root cause investigation
 - `none`: value_diff verified NO data changes — safe, note briefly in summary
-- `null` (or absent): couldn't run value_diff (views, no PK) — unknown, use profile_diff to assess
+- `potential`: value_diff was skipped (views, downstream models, no PK) — **MUST follow up** using the model's `next_action` before classifying as impacted or not_impacted. Do NOT put `potential` models in `not_impacted` without investigation.
 
 If `impacted_models` is empty: output the "No impact detected" summary (see Section 4) and STOP.
 
 ### Step 2 — Follow-up Investigation
 
-For each entry in `suggested_deep_dives`:
+**Priority order**: Models with `data_impact: potential` and a `next_action` field take priority over `suggested_deep_dives`. Follow every `next_action` — these are models where impact is unknown and classification depends on your investigation.
+
+Then, for remaining entries in `suggested_deep_dives`:
 
 **2a. Value diff** — If `value_diff` in impact_analysis shows `rows_changed > 0` or the suggestion mentions value changes, call:
 ```
@@ -95,7 +97,7 @@ This gives distributions (min, max, mean, nulls, distinct counts) that reveal th
 
 - If `columns` is null in the suggestion: call `profile_diff` on the whole model (omit `columns` parameter).
 - On any MCP error: record "tool skipped for {model}: {error reason}" and continue.
-- Limit to the first 3 suggested deep dives to control cost.
+- Always follow ALL `next_action` items from `potential` models. For additional `suggested_deep_dives` beyond that, limit to 3 to control cost.
 
 ### Step 3 — Root Cause Diagnosis