From 3c7b17718cf5845b757f7ac10e4ee5f029a71d91 Mon Sep 17 00:00:00 2001
From: Kulvir <kulvirgithub@gmail.com>
Date: Mon, 30 Mar 2026 16:20:51 -0700
Subject: [PATCH 1/2] feat: add built-in performance tracking with config flag
 and dashboard

Bake Step 11 (performance metrics) directly into code-review and
plan-review commands, gated by `performance_tracking` config flag.

Changes:
- Add `performance_tracking: false` to default config schema
- Add Step 11 to code-review and plan-review (records per-model
  response rates, unique/consensus findings, convergence votes)
- Add Step 8.5 to consensus-setup: opt-in prompt for performance
  tracking during wizard, creates JSON file if enabled
- Add `/consensus:performance` command: model leaderboard, prune
  candidates, MVPs, reliability issues, single-model deep dive
- Data stored at `~/.claude/multi-model-performance.json`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 plugins/consensus/commands/code-review.md     |  56 +++++++
 plugins/consensus/commands/consensus-setup.md |  34 ++++
 plugins/consensus/commands/performance.md     | 152 ++++++++++++++++++
 plugins/consensus/commands/plan-review.md     |  56 +++++++
 plugins/consensus/consensus.config.json       |   3 +-
 5 files changed, 300 insertions(+), 1 deletion(-)
 create mode 100644 plugins/consensus/commands/performance.md

diff --git a/plugins/consensus/commands/code-review.md b/plugins/consensus/commands/code-review.md
index ada04e8..3df79e7 100644
--- a/plugins/consensus/commands/code-review.md
+++ b/plugins/consensus/commands/code-review.md
@@ -469,6 +469,62 @@ TeamDelete
 
 On failure: preserve `$SESSION_DIR` for debugging and tell the user where files are.
 
+## Step 11: Record Performance Metrics (if enabled)
+
+Check the `performance_tracking` field from the config loaded in Step 0. If it is `true`, execute this step. If `false` or missing, skip entirely.
+
+### 11a: Gather Data
+
+For each model that participated (Claude + all models in `MODELS`, + CodeRabbit if it ran):
+
+1. **Response status**: Check `$SESSION_DIR/{model-id}.md`
+   - Exists and > 0 bytes → `"yes"`
+   - Exists but 0 bytes → `"failed"`
+   - Does not exist → `"timeout"`
+
+2. **Output size**: `wc -c < $SESSION_DIR/{model-id}.md 2>/dev/null || echo 0`
+
+3. **Convergence vote**: Read `$SESSION_DIR/{model-id}-convergence.md`. Find first occurrence of `APPROVE` or `CHANGES NEEDED`. If file missing or neither found → `null`. (CodeRabbit has no convergence vote — always `null`.)
+
+4. **Attribution counts from `$SESSION_DIR/draft.md`**: Find all `_Flagged by:` lines. For each line:
+   - Parse model names between `_Flagged by:` and `_`
+   - Normalize names to model IDs using config model names and IDs. Also: Claude→claude, CodeRabbit→coderabbit.
+   - If 1-2 models listed → each gets +1 `unique_findings`
+   - If 3+ models listed → each gets +1 `consensus_findings`
+   - Count total `_Flagged by:` lines as `total_findings`
+
+### 11b: Build Run Record
+
+```json
+{
+  "id": "{SESSION_ID from SESSION_DIR path suffix}",
+  "timestamp": "{current UTC ISO 8601}",
+  "command": "consensus:code-review",
+  "target": "{short description of review target}",
+  "review_type": "CODE_REVIEW",
+  "total_findings": "{count}",
+  "models": {
+    "{model-id}": {
+      "responded": "yes|failed|timeout",
+      "unique_findings": "{N}",
+      "consensus_findings": "{N}",
+      "convergence_vote": "APPROVE|CHANGES NEEDED|null",
+      "output_bytes": "{N}"
+    }
+  }
+}
+```
+
+### 11c: Append to JSON
+
+1. Read `~/.claude/multi-model-performance.json`. If it does not exist, create it with `{"version": 1, "runs": []}`.
+2. Parse JSON
+3. Check if `id` already exists in `runs[]` — if so, skip (dedup)
+4. Append run record to `runs[]`
+5. Write updated JSON back
+
+Report: `"Performance tracked: {total_findings} findings across {M} models. {len(runs)} total runs recorded."`
+
 ## Rules
 
 1. **No model names in prompts.** Never mention any AI model name in prompts sent to external tools. Frame as first-person.
diff --git a/plugins/consensus/commands/consensus-setup.md b/plugins/consensus/commands/consensus-setup.md
index 6f219d9..f520000 100644
--- a/plugins/consensus/commands/consensus-setup.md
+++ b/plugins/consensus/commands/consensus-setup.md
@@ -268,6 +268,36 @@ Do NOT finalize the config in this case. Delete `~/.claude/consensus.json` and a
 
 If quorum is still met after disabling failed models, rewrite `~/.claude/consensus.json` with updated enabled/disabled states.
 
+## Step 8.5: Performance Tracking
+
+Ask the user if they want to enable automatic performance tracking:
+
+```
+AskUserQuestion:
+  question: "Enable performance tracking? Records which models contribute unique findings so you can identify and prune underperformers."
+  header: "Performance Tracking"
+  options:
+    - label: "Yes (Recommended)"
+      description: "Append metrics to ~/.claude/multi-model-performance.json after each run. View with /consensus:performance."
+    - label: "No"
+      description: "Skip performance tracking"
+```
+
+If **Yes**:
+1. Set `"performance_tracking": true` in `~/.claude/consensus.json` (add the field to the config written in Step 7)
+2. If `~/.claude/multi-model-performance.json` does not exist, create it:
+   ```json
+   {
+     "version": 1,
+     "runs": []
+   }
+   ```
+3. Report: "Performance tracking enabled. View results anytime with `/consensus:performance`."
+
+If **No**:
+1. Set `"performance_tracking": false` in `~/.claude/consensus.json`
+2. Report: "Performance tracking disabled. You can enable it later by re-running `/consensus-setup`."
+
 ## Step 9: Summary
 
 Print the final summary:
@@ -282,9 +312,13 @@ Print the final summary:
 
 **Quorum:** {min_quorum} of {total}
 
+**Performance tracking:** {enabled / disabled}
+
 **Next steps:**
 - `/code-review [target]` — multi-model code review
 - `/plan-review [task]` — multi-model plan review
+- `/review [target]` — multi-model document or general review
+- `/consensus:performance` — view model performance data (if tracking enabled)
 - `/consensus-setup` — reconfigure models anytime
 ```
 
diff --git a/plugins/consensus/commands/performance.md b/plugins/consensus/commands/performance.md
new file mode 100644
index 0000000..d3acd77
--- /dev/null
+++ b/plugins/consensus/commands/performance.md
@@ -0,0 +1,152 @@
+---
+name: performance
+description: "View consolidated multi-model performance data — identify which models contribute and which to prune"
+---
+
+# Multi-Model Performance Dashboard
+
+Analyze performance data from `~/.claude/multi-model-performance.json` to show which models contribute unique findings, which are redundant, and overall response reliability.
+
+## Input
+
+`$ARGUMENTS` = optional filter. Examples:
+- Empty — show full dashboard
+- `last 10` — show only the last 10 runs
+- `code-review` — filter to code-review runs only
+- `plan-review` — filter to plan-review runs only
+- `model gpt` — show detailed stats for a specific model
+
+## Step 1: Load Data
+
+Read `~/.claude/multi-model-performance.json`.
+
+If it does not exist or is empty:
+**ABORT**: "No performance data found. Enable tracking via `/consensus-setup` (Step 8.5) or ensure `performance_tracking` is `true` in `~/.claude/consensus.json`."
+
+Parse JSON. Extract `runs[]` array.
+
+If `$ARGUMENTS` contains a filter, apply it:
+- `last N` → take only the last N runs (by timestamp)
+- `code-review` / `plan-review` / `review` → filter by `command` field
+- `model {id}` → skip to Step 4 (single model deep dive)
+
+Report: `"Loaded {len(runs)} runs ({filtered count if filtered})."`
+
+## Step 2: Model Leaderboard
+
+Aggregate across all (filtered) runs. For each model that appears in any run:
+
+1. **Runs participated**: Count of runs where `responded == "yes"`
+2. **Response rate**: `runs_participated / total_runs * 100`
+3. **Unique findings**: Sum of `unique_findings` across all runs
+4. **Consensus findings**: Sum of `consensus_findings` across all runs
+5. **Total contributions**: `unique + consensus`
+6. **Unique rate**: `unique / total_contributions * 100` (how often this model catches things others miss)
+7. **Avg convergence**: Count APPROVE vs CHANGES NEEDED votes
+
+Present as a sorted table (sort by unique findings descending):
+
+```
+## Model Leaderboard ({N} runs)
+
+| Model | Runs | Response % | Unique | Consensus | Total | Unique % | Convergence |
+|-------|------|-----------|--------|-----------|-------|----------|-------------|
+| {model} | {N} | {pct}% | {N} | {N} | {N} | {pct}% | {N}A / {N}C |
+| ... | ... | ... | ... | ... | ... | ... | ... |
+
+A = APPROVE, C = CHANGES NEEDED
+```
+
+## Step 3: Actionable Insights
+
+Based on the leaderboard, generate insights:
+
+### Prune Candidates
+Models with **0 unique findings** across all runs — they never catch anything the other models miss.
+
+```
+### Prune Candidates (0 unique findings)
+- {model}: {N} runs, 0 unique findings, {N} consensus findings
+  → Consider disabling in /consensus-setup to save time and API costs
+```
+
+### MVPs
+Models with the **highest unique finding rate** — they consistently catch issues others miss.
+
+```
+### MVPs (highest unique contribution)
+- {model}: {unique_findings} unique findings ({unique_rate}% of their contributions are unique)
+```
+
+### Reliability Issues
+Models with **response rate < 90%** — they timeout or fail frequently.
+
+```
+### Reliability Issues (response rate < 90%)
+- {model}: {response_rate}% response rate ({failures} failures, {timeouts} timeouts)
+```
+
+### Contrarians
+Models that vote **CHANGES NEEDED** more than 50% of the time during convergence — they frequently disagree with the synthesis.
+
+```
+### Contrarians (>50% CHANGES NEEDED votes)
+- {model}: {changes_needed_count}/{total_votes} convergence rounds resulted in objections
+```
+
+## Step 4: Single Model Deep Dive (if `model {id}` filter)
+
+If the user requested a specific model, show detailed per-run history:
+
+```
+## {model_name} — Detailed Performance
+
+**Overall:** {runs} runs, {response_rate}% response rate, {unique} unique / {consensus} consensus findings
+
+### Run History
+| Run | Date | Command | Target | Unique | Consensus | Vote | Output Size |
+|-----|------|---------|--------|--------|-----------|------|-------------|
+| {id} | {date} | {cmd} | {target} | {N} | {N} | {vote} | {bytes} |
+| ... | ... | ... | ... | ... | ... | ... | ... |
+
+### Trends
+- Finding rate trend: {increasing / stable / decreasing}
+- Most common contribution type: {unique / consensus}
+- Average output size: {N} bytes
+```
+
+## Step 5: Recommendations
+
+Based on all data, provide a concrete recommendation:
+
+```
+## Recommendations
+
+**Current panel:** Claude + {N} models, quorum={quorum}
+**Suggested changes:**
+- {Disable {model} — 0 unique findings in {N} runs}
+- {Keep {model} — highest unique rate at {N}%}
+- {Monitor {model} — response rate dropping ({N}%)}
+- {Consider lowering quorum to {N} if you disable {N} models}
+```
+
+## Data Location
+
+Performance data is stored at: `~/.claude/multi-model-performance.json`
+
+Each run record contains:
+- `id` — session ID
+- `timestamp` — UTC ISO 8601
+- `command` — which consensus command was used
+- `target` — what was reviewed/planned
+- `review_type` — CODE_REVIEW, PLAN_REVIEW, DOCUMENT_REVIEW, or GENERAL_REVIEW
+- `total_findings` — total number of findings in the synthesized output
+- `models` — per-model breakdown with response status, findings, convergence vote, and output size
+
+## Rules
+
+1. **Read-only.** This command never modifies the performance JSON. It only reads and analyzes.
+2. **Handle sparse data.** If a model appears in some runs but not others, calculate rates based on runs where it was expected (present in the config at the time).
+3. **No minimum run requirement.** Show data even with 1 run, but note: "Limited data — {N} run(s). Recommendations improve with more data."
+4. **Be specific.** Name exact models, give exact numbers. No vague "some models are underperforming."
+5. **Actionable output.** Every insight must have a clear action: disable, keep, monitor, or reconfigure.
diff --git a/plugins/consensus/commands/plan-review.md b/plugins/consensus/commands/plan-review.md
index 302db98..86df9ca 100644
--- a/plugins/consensus/commands/plan-review.md
+++ b/plugins/consensus/commands/plan-review.md
@@ -452,6 +452,62 @@ If in plan mode, attempt to call `EnterPlanMode` so the user can review the fina
 
 On failure: preserve `$SESSION_DIR` for debugging and tell the user where files are.
 
+## Step 11: Record Performance Metrics (if enabled)
+
+Check the `performance_tracking` field from the config loaded in Step 0. If it is `true`, execute this step. If `false` or missing, skip entirely.
+
+### 11a: Gather Data
+
+For each model that participated (Claude + all models in `MODELS`):
+
+1. **Response status**: Check `$SESSION_DIR/{model-id}.md`
+   - Exists and > 0 bytes → `"yes"`
+   - Exists but 0 bytes → `"failed"`
+   - Does not exist → `"timeout"`
+
+2. **Output size**: `wc -c < $SESSION_DIR/{model-id}.md 2>/dev/null || echo 0`
+
+3. **Convergence vote**: Read `$SESSION_DIR/{model-id}-convergence.md`. Find first occurrence of `APPROVE` or `CHANGES NEEDED`. If file missing or neither found → `null`.
+
+4. **Attribution counts from `$SESSION_DIR/draft.md`**: Find all `_Flagged by:` or `_Proposed by:` lines. For each line:
+   - Parse model names from the attribution marker
+   - Normalize names to model IDs using config model names and IDs. Also: Claude→claude.
+   - If 1-2 models listed → each gets +1 `unique_findings`
+   - If 3+ models listed → each gets +1 `consensus_findings`
+   - Count total attribution lines as `total_findings`
+
+### 11b: Build Run Record
+
+```json
+{
+  "id": "{SESSION_ID from SESSION_DIR path suffix}",
+  "timestamp": "{current UTC ISO 8601}",
+  "command": "consensus:plan-review",
+  "target": "{short description of task}",
+  "review_type": "PLAN_REVIEW",
+  "total_findings": "{count}",
+  "models": {
+    "{model-id}": {
+      "responded": "yes|failed|timeout",
+      "unique_findings": "{N}",
+      "consensus_findings": "{N}",
+      "convergence_vote": "APPROVE|CHANGES NEEDED|null",
+      "output_bytes": "{N}"
+    }
+  }
+}
+```
+
+### 11c: Append to JSON
+
+1. Read `~/.claude/multi-model-performance.json`. If it does not exist, create it with `{"version": 1, "runs": []}`.
+2. Parse JSON
+3. Check if `id` already exists in `runs[]` — if so, skip (dedup)
+4. Append run record to `runs[]`
+5. Write updated JSON back
+
+Report: `"Performance tracked: {total_findings} findings across {M} models. {len(runs)} total runs recorded."`
+
 ## Rules
 
 1. **No model names in prompts.** Never mention any AI model name in prompts sent to external tools. Frame as first-person.
diff --git a/plugins/consensus/consensus.config.json b/plugins/consensus/consensus.config.json
index 1d764a0..22f3ba5 100644
--- a/plugins/consensus/consensus.config.json
+++ b/plugins/consensus/consensus.config.json
@@ -65,5 +65,6 @@
       "enabled": true
     }
   ],
-  "min_quorum": 6
+  "min_quorum": 6,
+  "performance_tracking": false
 }

From cbbf5076f7e4817c4c7c8f1ef986887ebeab380c Mon Sep 17 00:00:00 2001
From: Kulvir <kulvirgithub@gmail.com>
Date: Wed, 1 Apr 2026 15:03:43 -0700
Subject: [PATCH 2/2] fix: remove `--approval-mode plan` from gemini CLI
 invocations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 plugins/consensus/commands/code-review.md     | 4 ++--
 plugins/consensus/commands/consensus-setup.md | 2 +-
 plugins/consensus/commands/plan-review.md     | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/plugins/consensus/commands/code-review.md b/plugins/consensus/commands/code-review.md
index 3df79e7..b70118a 100644
--- a/plugins/consensus/commands/code-review.md
+++ b/plugins/consensus/commands/code-review.md
@@ -236,7 +236,7 @@ SESSION_DIR={SESSION_DIR}
    codex exec -s read-only {EXTRA_DIRS_FLAGS} -o $SESSION_DIR/{MODEL_ID}.md - < $SESSION_DIR/prompt.md
 
    **If `{MODEL_COMMAND}` starts with `gemini`:**
-   gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}.md 2>&1
+   gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" > $SESSION_DIR/{MODEL_ID}.md 2>&1
 
    **If `{MODEL_COMMAND}` starts with `qwen`:**
    qwen {EXTRA_DIRS_FLAGS} --approval-mode plan -p "$(cat $SESSION_DIR/prompt.md)" -o text > $SESSION_DIR/{MODEL_ID}.md 2>&1
@@ -262,7 +262,7 @@ After sending the review, WAIT. The lead will send you a convergence prompt. Whe
    codex exec resume --last - < $SESSION_DIR/convergence-prompt-{MODEL_ID}.md > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
 
    **If `{MODEL_COMMAND}` starts with `gemini`:**
-   gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
+   gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
 
    **If `{MODEL_COMMAND}` starts with `qwen`:**
    qwen -c -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" -o text > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
diff --git a/plugins/consensus/commands/consensus-setup.md b/plugins/consensus/commands/consensus-setup.md
index f520000..de42ba1 100644
--- a/plugins/consensus/commands/consensus-setup.md
+++ b/plugins/consensus/commands/consensus-setup.md
@@ -131,7 +131,7 @@ The 9 supported models and their CLI mappings:
 | `nemotron` | Nemotron 120B | `kilo run -m openrouter/nvidia/nemotron-3-super-120b-a12b --auto` | OpenRouter only |
 | `mimo` | MiMo V2 Pro | `kilo run -m openrouter/xiaomi/mimo-v2-pro --auto` | OpenRouter only |
 
-**Note on native CLIs**: For `codex`, `gemini`, and `qwen`, set the config's `command` field to just `codex`, `gemini`, or `qwen`. The teammate template in the review/plan commands detects these and uses the correct native invocation patterns automatically (e.g., `codex exec -s read-only` for reviews, `codex exec resume --last` for convergence, `gemini -p` with `--approval-mode plan` for reviews, `gemini --resume latest` for convergence, `qwen --approval-mode plan -p` with `-o text` for reviews, `qwen -c -p` for convergence). The `resume_flag` field is ignored for native CLIs.
+**Note on native CLIs**: For `codex`, `gemini`, and `qwen`, set the config's `command` field to just `codex`, `gemini`, or `qwen`. The teammate template in the review/plan commands detects these and uses the correct native invocation patterns automatically (e.g., `codex exec -s read-only` for reviews, `codex exec resume --last` for convergence, `gemini -p` for reviews, `gemini --resume latest` for convergence, `qwen --approval-mode plan -p` with `-o text` for reviews, `qwen -c -p` for convergence). The `resume_flag` field is ignored for native CLIs.
 
 Determine which models are available:
 - **OpenRouter path**: All 9 available if `kilo` installed + API key set
diff --git a/plugins/consensus/commands/plan-review.md b/plugins/consensus/commands/plan-review.md
index 86df9ca..c74c2fe 100644
--- a/plugins/consensus/commands/plan-review.md
+++ b/plugins/consensus/commands/plan-review.md
@@ -248,7 +248,7 @@ SESSION_DIR={SESSION_DIR}
    codex exec -s read-only {EXTRA_DIRS_FLAGS} -o $SESSION_DIR/{MODEL_ID}.md - < $SESSION_DIR/prompt.md
 
    **If `{MODEL_COMMAND}` starts with `gemini`:**
-   gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}.md 2>&1
+   gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" > $SESSION_DIR/{MODEL_ID}.md 2>&1
 
    **If `{MODEL_COMMAND}` starts with `qwen`:**
    qwen {EXTRA_DIRS_FLAGS} --approval-mode plan -p "$(cat $SESSION_DIR/prompt.md)" -o text > $SESSION_DIR/{MODEL_ID}.md 2>&1
@@ -274,7 +274,7 @@ After sending the plan, WAIT. The lead will send you a convergence prompt. When
    codex exec resume --last - < $SESSION_DIR/convergence-prompt-{MODEL_ID}.md > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
 
    **If `{MODEL_COMMAND}` starts with `gemini`:**
-   gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
+   gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
 
    **If `{MODEL_COMMAND}` starts with `qwen`:**
    qwen -c -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" -o text > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1