Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 58 additions & 2 deletions plugins/consensus/commands/code-review.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ SESSION_DIR={SESSION_DIR}
codex exec -s read-only {EXTRA_DIRS_FLAGS} -o $SESSION_DIR/{MODEL_ID}.md - < $SESSION_DIR/prompt.md

**If `{MODEL_COMMAND}` starts with `gemini`:**
gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}.md 2>&1
gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" > $SESSION_DIR/{MODEL_ID}.md 2>&1

**If `{MODEL_COMMAND}` starts with `qwen`:**
qwen {EXTRA_DIRS_FLAGS} --approval-mode plan -p "$(cat $SESSION_DIR/prompt.md)" -o text > $SESSION_DIR/{MODEL_ID}.md 2>&1
Expand All @@ -262,7 +262,7 @@ After sending the review, WAIT. The lead will send you a convergence prompt. Whe
codex exec resume --last - < $SESSION_DIR/convergence-prompt-{MODEL_ID}.md > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1

**If `{MODEL_COMMAND}` starts with `gemini`:**
gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1

**If `{MODEL_COMMAND}` starts with `qwen`:**
qwen -c -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" -o text > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
Expand Down Expand Up @@ -469,6 +469,62 @@ TeamDelete

On failure: preserve `$SESSION_DIR` for debugging and tell the user where files are.

## Step 11: Record Performance Metrics (if enabled)

Check the `performance_tracking` field from the config loaded in Step 0. If it is `true`, execute this step. If `false` or missing, skip entirely.

### 11a: Gather Data

For each model that participated (Claude + all models in `MODELS`, + CodeRabbit if it ran):

1. **Response status**: Check `$SESSION_DIR/{model-id}.md`
- Exists and > 0 bytes → `"yes"`
- Exists but 0 bytes → `"failed"`
- Does not exist → `"timeout"`

2. **Output size**: `wc -c < $SESSION_DIR/{model-id}.md 2>/dev/null || echo 0`

3. **Convergence vote**: Read `$SESSION_DIR/{model-id}-convergence.md`. Find first occurrence of `APPROVE` or `CHANGES NEEDED`. If file missing or neither found → `null`. (CodeRabbit has no convergence vote — always `null`.)

4. **Attribution counts from `$SESSION_DIR/draft.md`**: Find all `_Flagged by:` lines. For each line:
- Parse model names between `_Flagged by:` and `_`
- Normalize names to model IDs using config model names and IDs. Also: Claude→claude, CodeRabbit→coderabbit.
- If 1-2 models listed → each gets +1 `unique_findings`
- If 3+ models listed → each gets +1 `consensus_findings`
- Count total `_Flagged by:` lines as `total_findings`

### 11b: Build Run Record

```json
{
"id": "{SESSION_ID from SESSION_DIR path suffix}",
"timestamp": "{current UTC ISO 8601}",
"command": "consensus:code-review",
"target": "{short description of review target}",
"review_type": "CODE_REVIEW",
"total_findings": "{count}",
"models": {
"{model-id}": {
"responded": "yes|failed|timeout",
"unique_findings": "{N}",
"consensus_findings": "{N}",
"convergence_vote": "APPROVE|CHANGES NEEDED|null",
"output_bytes": "{N}"
}
}
}
```

### 11c: Append to JSON

1. Read `~/.claude/multi-model-performance.json`. If it does not exist, create it with `{"version": 1, "runs": []}`.
2. Parse JSON
3. Check if `id` already exists in `runs[]` — if so, skip (dedup)
4. Append run record to `runs[]`
5. Write updated JSON back

Report: `"Performance tracked: {total_findings} findings across {M} models. {len(runs)} total runs recorded."`

## Rules

1. **No model names in prompts.** Never mention any AI model name in prompts sent to external tools. Frame as first-person.
Expand Down
36 changes: 35 additions & 1 deletion plugins/consensus/commands/consensus-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ The 9 supported models and their CLI mappings:
| `nemotron` | Nemotron 120B | `kilo run -m openrouter/nvidia/nemotron-3-super-120b-a12b --auto` | OpenRouter only |
| `mimo` | MiMo V2 Pro | `kilo run -m openrouter/xiaomi/mimo-v2-pro --auto` | OpenRouter only |

**Note on native CLIs**: For `codex`, `gemini`, and `qwen`, set the config's `command` field to just `codex`, `gemini`, or `qwen`. The teammate template in the review/plan commands detects these and uses the correct native invocation patterns automatically (e.g., `codex exec -s read-only` for reviews, `codex exec resume --last` for convergence, `gemini -p` with `--approval-mode plan` for reviews, `gemini --resume latest` for convergence, `qwen --approval-mode plan -p` with `-o text` for reviews, `qwen -c -p` for convergence). The `resume_flag` field is ignored for native CLIs.
**Note on native CLIs**: For `codex`, `gemini`, and `qwen`, set the config's `command` field to just `codex`, `gemini`, or `qwen`. The teammate template in the review/plan commands detects these and uses the correct native invocation patterns automatically (e.g., `codex exec -s read-only` for reviews, `codex exec resume --last` for convergence, `gemini -p` for reviews, `gemini --resume latest` for convergence, `qwen --approval-mode plan -p` with `-o text` for reviews, `qwen -c -p` for convergence). The `resume_flag` field is ignored for native CLIs.

Determine which models are available:
- **OpenRouter path**: All 9 available if `kilo` installed + API key set
Expand Down Expand Up @@ -268,6 +268,36 @@ Do NOT finalize the config in this case. Delete `~/.claude/consensus.json` and a

If quorum is still met after disabling failed models, rewrite `~/.claude/consensus.json` with updated enabled/disabled states.

## Step 8.5: Performance Tracking

Ask the user if they want to enable automatic performance tracking:

```
AskUserQuestion:
question: "Enable performance tracking? Records which models contribute unique findings so you can identify and prune underperformers."
header: "Performance Tracking"
options:
- label: "Yes (Recommended)"
description: "Append metrics to ~/.claude/multi-model-performance.json after each run. View with /consensus:performance."
- label: "No"
description: "Skip performance tracking"
```

If **Yes**:
1. Set `"performance_tracking": true` in `~/.claude/consensus.json` (add the field to the config written in Step 7)
2. If `~/.claude/multi-model-performance.json` does not exist, create it:
```json
{
"version": 1,
"runs": []
}
```
3. Report: "Performance tracking enabled. View results anytime with `/consensus:performance`."

If **No**:
1. Set `"performance_tracking": false` in `~/.claude/consensus.json`
2. Report: "Performance tracking disabled. You can enable it later by re-running `/consensus-setup`."

## Step 9: Summary

Print the final summary:
Expand All @@ -282,9 +312,13 @@ Print the final summary:

**Quorum:** {min_quorum} of {total}

**Performance tracking:** {enabled / disabled}

**Next steps:**
- `/code-review [target]` — multi-model code review
- `/plan-review [task]` — multi-model plan review
- `/review [target]` — multi-model document or general review
- `/consensus:performance` — view model performance data (if tracking enabled)
- `/consensus-setup` — reconfigure models anytime
```

Expand Down
152 changes: 152 additions & 0 deletions plugins/consensus/commands/performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
---
name: performance
description: "View consolidated multi-model performance data — identify which models contribute and which to prune"
---

# Multi-Model Performance Dashboard

Analyze performance data from `~/.claude/multi-model-performance.json` to show which models contribute unique findings, which are redundant, and overall response reliability.

## Input

`$ARGUMENTS` = optional filter. Examples:
- Empty — show full dashboard
- `last 10` — show only the last 10 runs
- `code-review` — filter to code-review runs only
- `plan-review` — filter to plan-review runs only
- `model gpt` — show detailed stats for a specific model

## Step 1: Load Data

Read `~/.claude/multi-model-performance.json`.

If it does not exist or is empty:
**ABORT**: "No performance data found. Enable tracking via `/consensus-setup` (Step 8.5) or ensure `performance_tracking` is `true` in `~/.claude/consensus.json`."

Parse JSON. Extract `runs[]` array.

If `$ARGUMENTS` contains a filter, apply it:
- `last N` → take only the last N runs (by timestamp)
- `code-review` / `plan-review` / `review` → filter by `command` field
- `model {id}` → skip to Step 4 (single model deep dive)

Report: `"Loaded {len(runs)} runs ({filtered count if filtered})."`

## Step 2: Model Leaderboard

Aggregate across all (filtered) runs. For each model that appears in any run:

1. **Runs participated**: Count of runs where `responded == "yes"`
2. **Response rate**: `runs_participated / total_runs * 100`
3. **Unique findings**: Sum of `unique_findings` across all runs
4. **Consensus findings**: Sum of `consensus_findings` across all runs
5. **Total contributions**: `unique + consensus`
6. **Unique rate**: `unique / total_contributions * 100` (how often this model catches things others miss)
7. **Avg convergence**: Count APPROVE vs CHANGES NEEDED votes

Present as a sorted table (sort by unique findings descending):

```
## Model Leaderboard ({N} runs)

| Model | Runs | Response % | Unique | Consensus | Total | Unique % | Convergence |
|-------|------|-----------|--------|-----------|-------|----------|-------------|
| {model} | {N} | {pct}% | {N} | {N} | {N} | {pct}% | {N}A / {N}C |
| ... | ... | ... | ... | ... | ... | ... | ... |

A = APPROVE, C = CHANGES NEEDED
```

## Step 3: Actionable Insights

Based on the leaderboard, generate insights:

### Prune Candidates
Models with **0 unique findings** across all runs — they never catch anything the other models miss.

```
### Prune Candidates (0 unique findings)
- {model}: {N} runs, 0 unique findings, {N} consensus findings
→ Consider disabling in /consensus-setup to save time and API costs
```

### MVPs
Models with the **highest unique finding rate** — they consistently catch issues others miss.

```
### MVPs (highest unique contribution)
- {model}: {unique_findings} unique findings ({unique_rate}% of their contributions are unique)
```

### Reliability Issues
Models with **response rate < 90%** — they timeout or fail frequently.

```
### Reliability Issues (response rate < 90%)
- {model}: {response_rate}% response rate ({failures} failures, {timeouts} timeouts)
```

### Contrarians
Models that vote **CHANGES NEEDED** more than 50% of the time during convergence — they frequently disagree with the synthesis.

```
### Contrarians (>50% CHANGES NEEDED votes)
- {model}: {changes_needed_count}/{total_votes} convergence rounds resulted in objections
```

## Step 4: Single Model Deep Dive (if `model {id}` filter)

If the user requested a specific model, show detailed per-run history:

```
## {model_name} — Detailed Performance

**Overall:** {runs} runs, {response_rate}% response rate, {unique} unique / {consensus} consensus findings

### Run History
| Run | Date | Command | Target | Unique | Consensus | Vote | Output Size |
|-----|------|---------|--------|--------|-----------|------|-------------|
| {id} | {date} | {cmd} | {target} | {N} | {N} | {vote} | {bytes} |
| ... | ... | ... | ... | ... | ... | ... | ... |

### Trends
- Finding rate trend: {increasing / stable / decreasing}
- Most common contribution type: {unique / consensus}
- Average output size: {N} bytes
```

## Step 5: Recommendations

Based on all data, provide a concrete recommendation:

```
## Recommendations

**Current panel:** Claude + {N} models, quorum={quorum}
**Suggested changes:**
- {Disable {model} — 0 unique findings in {N} runs}
- {Keep {model} — highest unique rate at {N}%}
- {Monitor {model} — response rate dropping ({N}%)}
- {Consider lowering quorum to {N} if you disable {N} models}
```

## Data Location

Performance data is stored at: `~/.claude/multi-model-performance.json`

Each run record contains:
- `id` — session ID
- `timestamp` — UTC ISO 8601
- `command` — which consensus command was used
- `target` — what was reviewed/planned
- `review_type` — CODE_REVIEW, PLAN_REVIEW, DOCUMENT_REVIEW, or GENERAL_REVIEW
- `total_findings` — total number of findings in the synthesized output
- `models` — per-model breakdown with response status, findings, convergence vote, and output size

## Rules

1. **Read-only.** This command never modifies the performance JSON. It only reads and analyzes.
2. **Handle sparse data.** If a model appears in some runs but not others, calculate rates based on runs where it was expected (present in the config at the time).
3. **No minimum run requirement.** Show data even with 1 run, but note: "Limited data — {N} run(s). Recommendations improve with more data."
4. **Be specific.** Name exact models, give exact numbers. No vague "some models are underperforming."
5. **Actionable output.** Every insight must have a clear action: disable, keep, monitor, or reconfigure.
60 changes: 58 additions & 2 deletions plugins/consensus/commands/plan-review.md
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,7 @@ SESSION_DIR={SESSION_DIR}
codex exec -s read-only {EXTRA_DIRS_FLAGS} -o $SESSION_DIR/{MODEL_ID}.md - < $SESSION_DIR/prompt.md

**If `{MODEL_COMMAND}` starts with `gemini`:**
gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}.md 2>&1
gemini {EXTRA_DIRS_FLAGS} -p "$(cat $SESSION_DIR/prompt.md)" > $SESSION_DIR/{MODEL_ID}.md 2>&1

**If `{MODEL_COMMAND}` starts with `qwen`:**
qwen {EXTRA_DIRS_FLAGS} --approval-mode plan -p "$(cat $SESSION_DIR/prompt.md)" -o text > $SESSION_DIR/{MODEL_ID}.md 2>&1
Expand All @@ -274,7 +274,7 @@ After sending the plan, WAIT. The lead will send you a convergence prompt. When
codex exec resume --last - < $SESSION_DIR/convergence-prompt-{MODEL_ID}.md > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1

**If `{MODEL_COMMAND}` starts with `gemini`:**
gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" --approval-mode plan > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
gemini --resume latest -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1

**If `{MODEL_COMMAND}` starts with `qwen`:**
qwen -c -p "$(cat $SESSION_DIR/convergence-prompt-{MODEL_ID}.md)" -o text > $SESSION_DIR/{MODEL_ID}-convergence.md 2>&1
Expand Down Expand Up @@ -452,6 +452,62 @@ If in plan mode, attempt to call `EnterPlanMode` so the user can review the fina

On failure: preserve `$SESSION_DIR` for debugging and tell the user where files are.

## Step 11: Record Performance Metrics (if enabled)

Check the `performance_tracking` field from the config loaded in Step 0. If it is `true`, execute this step. If `false` or missing, skip entirely.

### 11a: Gather Data

For each model that participated (Claude + all models in `MODELS`):

1. **Response status**: Check `$SESSION_DIR/{model-id}.md`
- Exists and > 0 bytes → `"yes"`
- Exists but 0 bytes → `"failed"`
- Does not exist → `"timeout"`

2. **Output size**: `wc -c < $SESSION_DIR/{model-id}.md 2>/dev/null || echo 0`

3. **Convergence vote**: Read `$SESSION_DIR/{model-id}-convergence.md`. Find first occurrence of `APPROVE` or `CHANGES NEEDED`. If file missing or neither found → `null`.

4. **Attribution counts from `$SESSION_DIR/draft.md`**: Find all `_Flagged by:` or `_Proposed by:` lines. For each line:
- Parse model names from the attribution marker
- Normalize names to model IDs using config model names and IDs. Also: Claude→claude.
- If 1-2 models listed → each gets +1 `unique_findings`
- If 3+ models listed → each gets +1 `consensus_findings`
- Count total attribution lines as `total_findings`

### 11b: Build Run Record

```json
{
"id": "{SESSION_ID from SESSION_DIR path suffix}",
"timestamp": "{current UTC ISO 8601}",
"command": "consensus:plan-review",
"target": "{short description of task}",
"review_type": "PLAN_REVIEW",
"total_findings": "{count}",
"models": {
"{model-id}": {
"responded": "yes|failed|timeout",
"unique_findings": "{N}",
"consensus_findings": "{N}",
"convergence_vote": "APPROVE|CHANGES NEEDED|null",
"output_bytes": "{N}"
}
}
}
```

### 11c: Append to JSON

1. Read `~/.claude/multi-model-performance.json`. If it does not exist, create it with `{"version": 1, "runs": []}`.
2. Parse JSON
3. Check if `id` already exists in `runs[]` — if so, skip (dedup)
4. Append run record to `runs[]`
5. Write updated JSON back

Report: `"Performance tracked: {total_findings} findings across {M} models. {len(runs)} total runs recorded."`

## Rules

1. **No model names in prompts.** Never mention any AI model name in prompts sent to external tools. Frame as first-person.
Expand Down
3 changes: 2 additions & 1 deletion plugins/consensus/consensus.config.json
Original file line number Diff line number Diff line change
Expand Up @@ -65,5 +65,6 @@
"enabled": true
}
],
"min_quorum": 6
"min_quorum": 6,
"performance_tracking": false
}