From 3c7b17718cf5845b757f7ac10e4ee5f029a71d91 Mon Sep 17 00:00:00 2001 From: Kulvir Date: Mon, 30 Mar 2026 16:20:51 -0700 Subject: [PATCH 1/2] feat: add built-in performance tracking with config flag and dashboard Bake Step 11 (performance metrics) directly into code-review and plan-review commands, gated by `performance_tracking` config flag. Changes: - Add `performance_tracking: false` to default config schema - Add Step 11 to code-review and plan-review (records per-model response rates, unique/consensus findings, convergence votes) - Add Step 8.5 to consensus-setup: opt-in prompt for performance tracking during wizard, creates JSON file if enabled - Add `/consensus:performance` command: model leaderboard, prune candidates, MVPs, reliability issues, single-model deep dive - Data stored at `~/.claude/multi-model-performance.json` Co-Authored-By: Claude Opus 4.6 (1M context) --- plugins/consensus/commands/code-review.md | 56 +++++++ plugins/consensus/commands/consensus-setup.md | 34 ++++ plugins/consensus/commands/performance.md | 152 ++++++++++++++++++ plugins/consensus/commands/plan-review.md | 56 +++++++ plugins/consensus/consensus.config.json | 3 +- 5 files changed, 300 insertions(+), 1 deletion(-) create mode 100644 plugins/consensus/commands/performance.md diff --git a/plugins/consensus/commands/code-review.md b/plugins/consensus/commands/code-review.md index ada04e8..3df79e7 100644 --- a/plugins/consensus/commands/code-review.md +++ b/plugins/consensus/commands/code-review.md @@ -469,6 +469,62 @@ TeamDelete On failure: preserve `$SESSION_DIR` for debugging and tell the user where files are. +## Step 11: Record Performance Metrics (if enabled) + +Check the `performance_tracking` field from the config loaded in Step 0. If it is `true`, execute this step. If `false` or missing, skip entirely. + +### 11a: Gather Data + +For each model that participated (Claude + all models in `MODELS`, + CodeRabbit if it ran): + +1. **Response status**: Check `$SESSION_DIR/{model-id}.md` + - Exists and > 0 bytes → `"yes"` + - Exists but 0 bytes → `"failed"` + - Does not exist → `"timeout"` + +2. **Output size**: `wc -c < $SESSION_DIR/{model-id}.md 2>/dev/null || echo 0` + +3. **Convergence vote**: Read `$SESSION_DIR/{model-id}-convergence.md`. Find first occurrence of `APPROVE` or `CHANGES NEEDED`. If file missing or neither found → `null`. (CodeRabbit has no convergence vote — always `null`.) + +4. **Attribution counts from `$SESSION_DIR/draft.md`**: Find all `_Flagged by:` lines. For each line: + - Parse model names between `_Flagged by:` and `_` + - Normalize names to model IDs using config model names and IDs. Also: Claude→claude, CodeRabbit→coderabbit. + - If 1-2 models listed → each gets +1 `unique_findings` + - If 3+ models listed → each gets +1 `consensus_findings` + - Count total `_Flagged by:` lines as `total_findings` + +### 11b: Build Run Record + +```json +{ + "id": "{SESSION_ID from SESSION_DIR path suffix}", + "timestamp": "{current UTC ISO 8601}", + "command": "consensus:code-review", + "target": "{short description of review target}", + "review_type": "CODE_REVIEW", + "total_findings": "{count}", + "models": { + "{model-id}": { + "responded": "yes|failed|timeout", + "unique_findings": "{N}", + "consensus_findings": "{N}", + "convergence_vote": "APPROVE|CHANGES NEEDED|null", + "output_bytes": "{N}" + } + } +} +``` + +### 11c: Append to JSON + +1. Read `~/.claude/multi-model-performance.json`. If it does not exist, create it with `{"version": 1, "runs": []}`. +2. Parse JSON +3. Check if `id` already exists in `runs[]` — if so, skip (dedup) +4. Append run record to `runs[]` +5. Write updated JSON back + +Report: `"Performance tracked: {total_findings} findings across {M} models. {len(runs)} total runs recorded."` + ## Rules 1. **No model names in prompts.** Never mention any AI model name in prompts sent to external tools. Frame as first-person. diff --git a/plugins/consensus/commands/consensus-setup.md b/plugins/consensus/commands/consensus-setup.md index 6f219d9..f520000 100644 --- a/plugins/consensus/commands/consensus-setup.md +++ b/plugins/consensus/commands/consensus-setup.md @@ -268,6 +268,36 @@ Do NOT finalize the config in this case. Delete `~/.claude/consensus.json` and a If quorum is still met after disabling failed models, rewrite `~/.claude/consensus.json` with updated enabled/disabled states. +## Step 8.5: Performance Tracking + +Ask the user if they want to enable automatic performance tracking: + +``` +AskUserQuestion: + question: "Enable performance tracking? Records which models contribute unique findings so you can identify and prune underperformers." + header: "Performance Tracking" + options: + - label: "Yes (Recommended)" + description: "Append metrics to ~/.claude/multi-model-performance.json after each run. View with /consensus:performance." + - label: "No" + description: "Skip performance tracking" +``` + +If **Yes**: +1. Set `"performance_tracking": true` in `~/.claude/consensus.json` (add the field to the config written in Step 7) +2. If `~/.claude/multi-model-performance.json` does not exist, create it: + ```json + { + "version": 1, + "runs": [] + } + ``` +3. Report: "Performance tracking enabled. View results anytime with `/consensus:performance`." + +If **No**: +1. Set `"performance_tracking": false` in `~/.claude/consensus.json` +2. Report: "Performance tracking disabled. You can enable it later by re-running `/consensus-setup`." + ## Step 9: Summary Print the final summary: @@ -282,9 +312,13 @@ Print the final summary: **Quorum:** {min_quorum} of {total} +**Performance tracking:** {enabled / disabled} + **Next steps:** - `/code-review [target]` — multi-model code review - `/plan-review [task]` — multi-model plan review +- `/review [target]` — multi-model document or general review +- `/consensus:performance` — view model performance data (if tracking enabled) - `/consensus-setup` — reconfigure models anytime ``` diff --git a/plugins/consensus/commands/performance.md b/plugins/consensus/commands/performance.md new file mode 100644 index 0000000..d3acd77 --- /dev/null +++ b/plugins/consensus/commands/performance.md @@ -0,0 +1,152 @@ +--- +name: performance +description: "View consolidated multi-model performance data — identify which models contribute and which to prune" +--- + +# Multi-Model Performance Dashboard + +Analyze performance data from `~/.claude/multi-model-performance.json` to show which models contribute unique findings, which are redundant, and overall response reliability. + +## Input + +`$ARGUMENTS` = optional filter. Examples: +- Empty — show full dashboard +- `last 10` — show only the last 10 runs +- `code-review` — filter to code-review runs only +- `plan-review` — filter to plan-review runs only +- `model gpt` — show detailed stats for a specific model + +## Step 1: Load Data + +Read `~/.claude/multi-model-performance.json`. + +If it does not exist or is empty: +**ABORT**: "No performance data found. Enable tracking via `/consensus-setup` (Step 8.5) or ensure `performance_tracking` is `true` in `~/.claude/consensus.json`." + +Parse JSON. Extract `runs[]` array. + +If `$ARGUMENTS` contains a filter, apply it: +- `last N` → take only the last N runs (by timestamp) +- `code-review` / `plan-review` / `review` → filter by `command` field +- `model {id}` → skip to Step 4 (single model deep dive) + +Report: `"Loaded {len(runs)} runs ({filtered count if filtered})."` + +## Step 2: Model Leaderboard + +Aggregate across all (filtered) runs. For each model that appears in any run: + +1. **Runs participated**: Count of runs where `responded == "yes"` +2. **Response rate**: `runs_participated / total_runs * 100` +3. **Unique findings**: Sum of `unique_findings` across all runs +4. **Consensus findings**: Sum of `consensus_findings` across all runs +5. **Total contributions**: `unique + consensus` +6. **Unique rate**: `unique / total_contributions * 100` (how often this model catches things others miss) +7. **Avg convergence**: Count APPROVE vs CHANGES NEEDED votes + +Present as a sorted table (sort by unique findings descending): + +``` +## Model Leaderboard ({N} runs) + +| Model | Runs | Response % | Unique | Consensus | Total | Unique % | Convergence | +|-------|------|-----------|--------|-----------|-------|----------|-------------| +| {model} | {N} | {pct}% | {N} | {N} | {N} | {pct}% | {N}A / {N}C | +| ... | ... | ... | ... | ... | ... | ... | ... | + +A = APPROVE, C = CHANGES NEEDED +``` + +## Step 3: Actionable Insights + +Based on the leaderboard, generate insights: + +### Prune Candidates +Models with **0 unique findings** across all runs — they never catch anything the other models miss. + +``` +### Prune Candidates (0 unique findings) +- {model}: {N} runs, 0 unique findings, {N} consensus findings + → Consider disabling in /consensus-setup to save time and API costs +``` + +### MVPs +Models with the **highest unique finding rate** — they consistently catch issues others miss. + +``` +### MVPs (highest unique contribution) +- {model}: {unique_findings} unique findings ({unique_rate}% of their contributions are unique) +``` + +### Reliability Issues +Models with **response rate < 90%** — they timeout or fail frequently. + +``` +### Reliability Issues (response rate < 90%) +- {model}: {response_rate}% response rate ({failures} failures, {timeouts} timeouts) +``` + +### Contrarians +Models that vote **CHANGES NEEDED** more than 50% of the time during convergence — they frequently disagree with the synthesis. + +``` +### Contrarians (>50% CHANGES NEEDED votes) +- {model}: {changes_needed_count}/{total_votes} convergence rounds resulted in objections +``` + +## Step 4: Single Model Deep Dive (if `model {id}` filter) + +If the user requested a specific model, show detailed per-run history: + +``` +## {model_name} — Detailed Performance + +**Overall:** {runs} runs, {response_rate}% response rate, {unique} unique / {consensus} consensus findings + +### Run History +| Run | Date | Command | Target | Unique | Consensus | Vote | Output Size | +|-----|------|---------|--------|--------|-----------|------|-------------| +| {id} | {date} | {cmd} | {target} | {N} | {N} | {vote} | {bytes} | +| ... | ... | ... | ... | ... | ... | ... | ... | + +### Trends +- Finding rate trend: {increasing / stable / decreasing} +- Most common contribution type: {unique / consensus} +- Average output size: {N} bytes +``` + +## Step 5: Recommendations + +Based on all data, provide a concrete recommendation: + +``` +## Recommendations + +**Current panel:** Claude + {N} models, quorum={quorum} +**Suggested changes:** +- {Disable {model} — 0 unique findings in {N} runs} +- {Keep {model} — highest unique rate at {N}%} +- {Monitor {model} — response rate dropping ({N}%)} +- {Consider lowering quorum to {N} if you disable {N} models} +``` + +## Data Location + +Performance data is stored at: `~/.claude/multi-model-performance.json` + +Each run record contains: +- `id` — session ID +- `timestamp` — UTC ISO 8601 +- `command` — which consensus command was used +- `target` — what was reviewed/planned +- `review_type` — CODE_REVIEW, PLAN_REVIEW, DOCUMENT_REVIEW, or GENERAL_REVIEW +- `total_findings` — total number of findings in the synthesized output +- `models` — per-model breakdown with response status, findings, convergence vote, and output size + +## Rules + +1. **Read-only.** This command never modifies the performance JSON. It only reads and analyzes. +2. **Handle sparse data.** If a model appears in some runs but not others, calculate rates based on runs where it was expected (present in the config at the time). +3. **No minimum run requirement.** Show data even with 1 run, but note: "Limited data — {N} run(s). Recommendations improve with more data." +4. **Be specific.** Name exact models, give exact numbers. No vague "some models are underperforming." +5. **Actionable output.** Every insight must have a clear action: disable, keep, monitor, or reconfigure. diff --git a/plugins/consensus/commands/plan-review.md b/plugins/consensus/commands/plan-review.md index 302db98..86df9ca 100644 --- a/plugins/consensus/commands/plan-review.md +++ b/plugins/consensus/commands/plan-review.md @@ -452,6 +452,62 @@ If in plan mode, attempt to call `EnterPlanMode` so the user can review the fina On failure: preserve `$SESSION_DIR` for debugging and tell the user where files are. +## Step 11: Record Performance Metrics (if enabled) + +Check the `performance_tracking` field from the config loaded in Step 0. If it is `true`, execute this step. If `false` or missing, skip entirely. + +### 11a: Gather Data + +For each model that participated (Claude + all models in `MODELS`): + +1. **Response status**: Check `$SESSION_DIR/{model-id}.md` + - Exists and > 0 bytes → `"yes"` + - Exists but 0 bytes → `"failed"` + - Does not exist → `"timeout"` + +2. **Output size**: `wc -c < $SESSION_DIR/{model-id}.md 2>/dev/null || echo 0` + +3. **Convergence vote**: Read `$SESSION_DIR/{model-id}-convergence.md`. Find first occurrence of `APPROVE` or `CHANGES NEEDED`. If file missing or neither found → `null`. + +4. **Attribution counts from `$SESSION_DIR/draft.md`**: Find all `_Flagged by:` or `_Proposed by:` lines. For each line: + - Parse model names from the attribution marker + - Normalize names to model IDs using config model names and IDs. Also: Claude→claude. + - If 1-2 models listed → each gets +1 `unique_findings` + - If 3+ models listed → each gets +1 `consensus_findings` + - Count total attribution lines as `total_findings` + +### 11b: Build Run Record + +```json +{ + "id": "{SESSION_ID from SESSION_DIR path suffix}", + "timestamp": "{current UTC ISO 8601}", + "command": "consensus:plan-review", + "target": "{short description of task}", + "review_type": "PLAN_REVIEW", + "total_findings": "{count}", + "models": { + "{model-id}": { + "responded": "yes|failed|timeout", + "unique_findings": "{N}", + "consensus_findings": "{N}", + "convergence_vote": "APPROVE|CHANGES NEEDED|null", + "output_bytes": "{N}" + } + } +} +``` + +### 11c: Append to JSON + +1. Read `~/.claude/multi-model-performance.json`. If it does not exist, create it with `{"version": 1, "runs": []}`. +2. Parse JSON +3. Check if `id` already exists in `runs[]` — if so, skip (dedup) +4. Append run record to `runs[]` +5. Write updated JSON back + +Report: `"Performance tracked: {total_findings} findings across {M} models. {len(runs)} total runs recorded."` + ## Rules 1. **No model names in prompts.** Never mention any AI model name in prompts sent to external tools. Frame as first-person. diff --git a/plugins/consensus/consensus.config.json b/plugins/consensus/consensus.config.json index 1d764a0..22f3ba5 100644 --- a/plugins/consensus/consensus.config.json +++ b/plugins/consensus/consensus.config.json @@ -65,5 +65,6 @@ "enabled": true } ], - "min_quorum": 6 + "min_quorum": 6, + "performance_tracking": false } From cbbf5076f7e4817c4c7c8f1ef986887ebeab380c Mon Sep 17 00:00:00 2001 From: Kulvir Date: Wed, 1 Apr 2026 15:03:43 -0700 Subject: [PATCH 2/2] fix: remove `--approval-mode plan` from gemini CLI invocations Co-Authored-By: Claude Opus 4.6 (1M context) --- plugins/consensus/commands/code-review.md | 4 ++-- plugins/consensus/commands/consensus-setup.md | 2 +- plugins/consensus/commands/plan-review.md | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/plugins/consensus/commands/code-review.md b/plugins/consensus/commands/code-review.md index 3df79e7..b70118a 100644 --- a/plugins/consensus/commands/code-review.md +++ b/plugins/consensus/commands/code-review.md @@ -236,7 +236,7 @@ SESSION_DIR={SESSION_DIR} codex exec -s read-only {EXTRA_DIRS_FLAGS} -o $SESSION_DIR/{MODEL_ID}.md - < $SESSION_DIR/prompt.md **If `{MODEL_COMMAND}` starts with `gemini`:** - gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}.md 2>&1 + gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" > $SESSION_DIR/{MODEL_ID}.md 2>&1 **If `{MODEL_COMMAND}` starts with `qwen`:** qwen {EXTRA_DIRS_FLAGS} --approval-mode plan -p "$(cat $SESSION_DIR/prompt.md)" -o text > $SESSION_DIR/{MODEL_ID}.md 2>&1 @@ -262,7 +262,7 @@ After sending the review, WAIT. The lead will send you a convergence prompt. Whe codex exec resume --last - < $SESSION_DIR/convergence-prompt-{MODEL_ID}.md > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1 **If `{MODEL_COMMAND}` starts with `gemini`:** - gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1 + gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1 **If `{MODEL_COMMAND}` starts with `qwen`:** qwen -c -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" -o text > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1 diff --git a/plugins/consensus/commands/consensus-setup.md b/plugins/consensus/commands/consensus-setup.md index f520000..de42ba1 100644 --- a/plugins/consensus/commands/consensus-setup.md +++ b/plugins/consensus/commands/consensus-setup.md @@ -131,7 +131,7 @@ The 9 supported models and their CLI mappings: | `nemotron` | Nemotron 120B | `kilo run -m openrouter/nvidia/nemotron-3-super-120b-a12b --auto` | OpenRouter only | | `mimo` | MiMo V2 Pro | `kilo run -m openrouter/xiaomi/mimo-v2-pro --auto` | OpenRouter only | -**Note on native CLIs**: For `codex`, `gemini`, and `qwen`, set the config's `command` field to just `codex`, `gemini`, or `qwen`. The teammate template in the review/plan commands detects these and uses the correct native invocation patterns automatically (e.g., `codex exec -s read-only` for reviews, `codex exec resume --last` for convergence, `gemini -p` with `--approval-mode plan` for reviews, `gemini --resume latest` for convergence, `qwen --approval-mode plan -p` with `-o text` for reviews, `qwen -c -p` for convergence). The `resume_flag` field is ignored for native CLIs. +**Note on native CLIs**: For `codex`, `gemini`, and `qwen`, set the config's `command` field to just `codex`, `gemini`, or `qwen`. The teammate template in the review/plan commands detects these and uses the correct native invocation patterns automatically (e.g., `codex exec -s read-only` for reviews, `codex exec resume --last` for convergence, `gemini -p` for reviews, `gemini --resume latest` for convergence, `qwen --approval-mode plan -p` with `-o text` for reviews, `qwen -c -p` for convergence). The `resume_flag` field is ignored for native CLIs. Determine which models are available: - **OpenRouter path**: All 9 available if `kilo` installed + API key set diff --git a/plugins/consensus/commands/plan-review.md b/plugins/consensus/commands/plan-review.md index 86df9ca..c74c2fe 100644 --- a/plugins/consensus/commands/plan-review.md +++ b/plugins/consensus/commands/plan-review.md @@ -248,7 +248,7 @@ SESSION_DIR={SESSION_DIR} codex exec -s read-only {EXTRA_DIRS_FLAGS} -o $SESSION_DIR/{MODEL_ID}.md - < $SESSION_DIR/prompt.md **If `{MODEL_COMMAND}` starts with `gemini`:** - gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}.md 2>&1 + gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" > $SESSION_DIR/{MODEL_ID}.md 2>&1 **If `{MODEL_COMMAND}` starts with `qwen`:** qwen {EXTRA_DIRS_FLAGS} --approval-mode plan -p "$(cat $SESSION_DIR/prompt.md)" -o text > $SESSION_DIR/{MODEL_ID}.md 2>&1 @@ -274,7 +274,7 @@ After sending the plan, WAIT. The lead will send you a convergence prompt. When codex exec resume --last - < $SESSION_DIR/convergence-prompt-{MODEL_ID}.md > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1 **If `{MODEL_COMMAND}` starts with `gemini`:** - gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1 + gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1 **If `{MODEL_COMMAND}` starts with `qwen`:** qwen -c -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" -o text > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1