Where LLMs Debate Your Code
Multiple LLMs review your code in parallel, debate conflicting opinions, then a head agent delivers the final verdict. Different models catch different bugs — consensus filters the noise.
npm i -g @codeagora/review
agora init
git diff | agora reviewPackage line note:
codeagora@2.xis now the legacy package line. Review-focused releases restart as@codeagora/review@0.xwhile keeping thecodeagoraandagoraCLI binaries.
agora init auto-detects your API keys and CLI tools, then generates a config.
| Provider | Type | Cost |
|---|---|---|
| Groq | API | Free |
| Anthropic | API | Paid |
| Claude Code | CLI | Subscription |
| Gemini CLI | CLI | Free |
| Codex CLI | CLI | Subscription |
Full provider list (24+ API, 12 CLI) ->
git diff | agora review
Pre --- Semantic Diff Classification
--- TypeScript Diagnostics
--- Change Impact Analysis
|
L1 --- Reviewer A (security) --+
--- Reviewer B (logic) --+-- parallel specialist reviews
--- Reviewer C (general) --+
|
Filter -- Hallucination Check (file/line validation)
--- Self-contradiction Filter
--- Evidence Dedup
|
L2 --- Adversarial Discussion (supporters must disprove)
--- Static analysis evidence in debate
|
L3 --- Head Agent --> ACCEPT / REJECT / NEEDS_HUMAN
|
Output -- Triage: N must-fix / N verify / N ignore
The old web dashboard and terminal TUI are being consolidated toward a planned cross-platform Tauri desktop app.
The CLI remains the primary automation surface for LLM agents and CI. The desktop app is intended to become the human-facing local UI for review history, configuration, progress, costs, and result exploration, but it is not part of the stable support surface yet.
An initial private preview scaffold lives in packages/desktop while the desktop MVP takes shape.
9-tool MCP server for AI IDE integration.
// claude_desktop_config.json or .cursor/mcp.json
{
"mcpServers": {
"codeagora": {
"command": "npx",
"args": ["-y", "@codeagora/mcp"]
}
}
}Tools: review_quick, review_full, review_pr, dry_run, explain_session, get_leaderboard, get_stats, config_get, config_set.
Package-local MCP onboarding: packages/mcp/README.md.
All extensions are optional — install only what you need.
| Package | Install | What it does |
|---|---|---|
| @codeagora/mcp | npm i -g @codeagora/mcp |
MCP server (9 tools) — integrates with Claude Code, Cursor, and any MCP-compatible IDE |
The core codeagora CLI includes everything needed for command-line reviews and GitHub Actions. Human-facing UI work is moving into the desktop app.
Add CodeAgora to any repo in 3 steps:
1. Create .ca/config.json (or run agora init):
{
"mode": "pragmatic",
"reviewers": [
{ "id": "r1", "model": "llama-3.3-70b-versatile", "backend": "api", "provider": "groq", "enabled": true, "timeout": 120 },
{ "id": "r2", "model": "qwen/qwen3-32b", "backend": "api", "provider": "groq", "enabled": true, "timeout": 120 },
{ "id": "r3", "model": "meta-llama/llama-4-scout-17b-16e-instruct", "backend": "api", "provider": "groq", "enabled": true, "timeout": 120 }
]
}2. Add the workflow (.github/workflows/codeagora-review.yml):
name: CodeAgora Review
on:
pull_request:
types: [opened, synchronize, reopened]
permissions:
contents: read
pull-requests: write
statuses: write
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: bssm-oss/CodeAgora@v0.1.0-beta.1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}3. Add GROQ_API_KEY to your repo's Settings > Secrets > Actions.
Every PR gets inline review comments, a summary verdict, and a commit status check. Add review:skip label to any PR to bypass.
Note: Use the v0.1.0-beta.1 Action tag for the scoped beta package line; v2 belongs to the legacy codeagora@2.x line. The Action defaults to .ca/config.json in the repo root. You can override this via the config-path input (CLI flag wins, then the CONFIG_PATH env). fail-on-reject defaults to true (Action exits with code 1 on REJECT). The review:skip label is caller-owned and will not be modified by the Action. Any degraded/skipped run surfaces degraded and degraded-reason outputs.
| Doc | Content |
|---|---|
| CLI Reference | All commands and options |
| Configuration | Config file guide |
| Providers | Full provider list with tiers |
| Architecture | Pipeline design and project structure |
| Extensions | MCP and desktop direction |
| Product Surface Plan | Current surfaces and lightweight roadmap |
| Production Readiness Roadmap | Gates for production-ready CLI, GitHub Action, and MCP releases |
| Beta Readiness | P4-P6 beta gates, smoke checks, and release guardrails |
| Agent Contract | Stable JSON, NDJSON, exit codes, and MCP output semantics |
| Troubleshooting | Common errors and fixes, exit codes |
| FAQ | Frequently asked questions |
| Archived Korean Docs | Historical Korean translations; English root docs are canonical |
pnpm install && pnpm build
pnpm test # run the Vitest suite
pnpm test:coverage # with coverage report
pnpm typecheck
pnpm dev review path/to/diff.patchGolden-bug fixtures under benchmarks/golden-bugs/ drive the false-negative and FP-regression framework (see #472). The current deterministic offline gate covers 20 fixtures: 14 recall cases and 6 FP-regression cases.
Required offline gate (fast, no API calls):
pnpm bench:ci # schema + reference gate for CI
pnpm bench:fn -- --validate-only # schema-check fixtures
pnpm bench:reference -- --validate-only # validate the 20-fixture reference gateThe required gate is provider-free and protects schema/reference regressions. Live benchmark runs are separate manual evidence artifacts for quality claims, and bench-out* result directories stay uncommitted; CI/workflows should upload artifacts instead. The latest stable-candidate live benchmark capture is docs/live-benchmark-report.md.
Score pre-computed results (fast, no API calls):
pnpm bench:fn -- --results path/to/results-dir # score against pre-computed review output
pnpm bench:fn -- --results path/to/results-dir --json # CI-friendly JSON report
pnpm bench:fn:compare -- --baseline old-results --candidate new-resultsRun the live pipeline against every fixture (produces the results dir above):
export OPENROUTER_API_KEY=...
pnpm bench:fn:run -- --results ./bench-out
pnpm bench:fn -- --results ./bench-outThe driver uses benchmarks/.ca/config.json by default. Dedicated run configs live under benchmarks/.ca/, including config.free-smoke.json for a one-fixture free-model gate and config.low-cost-diverse.json for the current low-cost diverse benchmark. Add --fixtures id1,id2 to restrict, --skip-head to skip the L3 verdict stage.
Two fixture kinds live side by side:
- Recall cases (
expectedFindingsnon-empty) — review must surface each listed bug. Misses count as FN. - FP regression cases (
expectedFindingsis[]) — review must report nothing. Any finding is a regression.
Current reference fixtures: 14 recall cases + 6 FP regression cases. See benchmarks/golden-bugs/README.md for fixture format and reference-gate semantics.
Full report: docs/golden-bug-benchmark-report-2026-04-27.md.
Smoke gate:
pnpm bench:fn:run -- --results ./bench-out-smoke \
--config benchmarks/.ca/config.free-smoke.json \
--fixtures authz-admin-bypass \
--skip-head
pnpm bench:fn -- --results ./bench-out-smokeThe smoke run executed only authz-admin-bypass and passed that fixture (1/1, fp=0). The full-suite aggregate for bench-out-smoke is intentionally not meaningful because the other fixtures were not run.
Full low-cost diverse run:
pnpm bench:fn:run -- --results ./bench-out-low-cost-confirmed-20260427 \
--config benchmarks/.ca/config.low-cost-diverse.json \
--skip-head
pnpm bench:fn -- --results ./bench-out-low-cost-confirmed-20260427The 2026-04-28 follow-up added auth-session-dual as a non-quota same-file multi-bug recall fixture, then reran that fixture into the same results directory:
pnpm bench:fn:run -- --results ./bench-out-low-cost-confirmed-20260427 \
--config benchmarks/.ca/config.low-cost-diverse.json \
--fixtures auth-session-dual \
--skip-head
pnpm bench:fn -- --results ./bench-out-low-cost-confirmed-20260427| Metric | Result |
|---|---|
| Total fixtures | 12 |
| Recall / FP-regression fixtures | 8 / 4 |
| Expected findings | 10 |
| Actual findings | 32 |
| TP / FP / FN | 10 / 0 / 0 |
| Precision | 100.0% |
| Recall | 100.0% |
| F1 | 100.0% |
| FP clean-rate | 100.0% |
| mean recall@3 / @5 / @10 | 100.0% / 100.0% / 100.0% |
| FP regressions triggered | 0/4 |
Per-fixture result: every recall fixture passed with fp=0 and r@3=100.0%; every FP regression fixture passed. quota-manager-dual and auth-session-dual both score 2/2, fp=0, and r@3=100.0% in the confirmed aggregate.
bench:fn:run also writes per-fixture runtime metadata under <results>/_meta/. For the targeted auth-session-dual run, _meta/auth-session-dual.json recorded 4 backend calls, 31,504ms total backend latency, 32,636ms wall time, and cost N/A because no token usage was returned for those calls.
Session baseline before tuning was TP=5 FP=20 FN=3, precision 20.0%, recall 62.5%, F1 30.3%, and FP clean-rate 50.0% on the low-cost diverse run.
Three live runs with the default 3-reviewer OpenRouter config (#24666562754, #24667305646, #24667897271):
| Metric | Mean | Min | Max |
|---|---|---|---|
| recall@3 | 100.0% | 100.0% | 100.0% |
| recall@5 | 100.0% | 100.0% | 100.0% |
| recall@10 | 100.0% | 100.0% | 100.0% |
| FPs per fp-regression fixture | 2.3 | 2 | 3 |
| fp-regression triggered | 3/3 runs |
Recall stable — all three recall cases (off-by-one, null-deref, SQL injection) caught in top-3 on every run.
FP regression triggered on every run — but the content of the phantom findings shifts between runs: CRITICAL×3 about unhandled JSON.parse on run 1, WARNING×2 about regex DoS + input size on run 2, WARNING + CRITICAL about unbounded string + missing type import on run 3. Each individual claim is a plausible-sounding, code-level assertion that the review would make against a real diff, which is exactly why the current calibration stack does not filter them. This confirms the "high-confidence corroborated FP" blind spot documented in project_calibration_stack.md. This fixture is the regression gate for future calibration work (see #468).
MIT