|
| 1 | +# Claude Code Benchmark Results |
| 2 | + |
| 3 | +This page records the Claude Code compatibility-backend runs available in the |
| 4 | +local experiment artifacts. Bayesian-Agent handled benchmark orchestration, |
| 5 | +grading, and result writing; Claude Code executed each task prompt through the |
| 6 | +adapter boundary. |
| 7 | + |
| 8 | +Evidence type: newly run / verified local artifact. Metrics were read from |
| 9 | +`results/claude_code/**/baseline/results.json` and |
| 10 | +`results/claude_code_smoke/**/baseline/results.json`. |
| 11 | + |
| 12 | +## Commands |
| 13 | + |
| 14 | +Full baseline runs used this harness shape: |
| 15 | + |
| 16 | +```bash |
| 17 | +python experiments/run_benchmarks.py \ |
| 18 | + --harness claude-code \ |
| 19 | + --model "$MODEL" \ |
| 20 | + --bench "$BENCH" \ |
| 21 | + --mode baseline \ |
| 22 | + --out-root "results/claude_code/${MODEL_SLUG}/${BENCH}" |
| 23 | +``` |
| 24 | + |
| 25 | +Smoke runs used the same backend with `--limit 1` and smoke-specific output |
| 26 | +roots under `results/claude_code_smoke/`. |
| 27 | + |
| 28 | +## Full Baseline Results |
| 29 | + |
| 30 | +| Model | Benchmark | Accuracy | Total Tokens | Efficiency | Failed Tasks | |
| 31 | +|---|---:|---:|---:|---:|---| |
| 32 | +| deepseek-v4-flash | SOP-Bench | 18/20 (90%) | 5,887,709 | 3.06 | `sop_15`, `sop_16` | |
| 33 | +| deepseek-v4-flash | Lifelong AgentBench | 20/20 (100%) | 1,552,970 | 12.88 | none | |
| 34 | +| deepseek-v4-flash | RealFin-Bench | 31/40 (77.5%) | 49,414,353 | 0.63 | 9 tasks | |
| 35 | +| deepseek-v4-pro[1m] | SOP-Bench | 13/20 (65%) | 2,759,895 | 4.71 | 7 tasks | |
| 36 | +| deepseek-v4-pro[1m] | Lifelong AgentBench | 20/20 (100%) | 1,552,632 | 12.88 | none | |
| 37 | +| deepseek-v4-pro[1m] | RealFin-Bench | 26/40 (65%) | 27,034,780 | 0.96 | 14 tasks | |
| 38 | + |
| 39 | +## Smoke Results |
| 40 | + |
| 41 | +| Model | Benchmark | Accuracy | Total Tokens | Efficiency | Failed Tasks | |
| 42 | +|---|---:|---:|---:|---:|---| |
| 43 | +| deepseek-v4-flash | SOP-Bench | 1/1 (100%) | 276,483 | 3.62 | none | |
| 44 | +| deepseek-v4-flash | Lifelong AgentBench | 1/1 (100%) | 75,551 | 13.24 | none | |
| 45 | +| deepseek-v4-flash | RealFin-Bench | 0/1 (0%) | 686,454 | 0.00 | `task_01_macd_rsi_filter` | |
| 46 | +| deepseek-v4-pro[1m] | SOP-Bench | 1/1 (100%) | 149,332 | 6.70 | none | |
| 47 | +| deepseek-v4-pro[1m] | Lifelong AgentBench | 1/1 (100%) | 75,565 | 13.23 | none | |
| 48 | +| deepseek-v4-pro[1m] | RealFin-Bench | 0/1 (0%) | 1,040,695 | 0.00 | `task_01_macd_rsi_filter` | |
| 49 | + |
| 50 | +## RealFin Failures |
| 51 | + |
| 52 | +### deepseek-v4-flash |
| 53 | + |
| 54 | +- `task_01_macd_rsi_filter` |
| 55 | +- `task_11_semiconductor_macd` |
| 56 | +- `task_15_volume_price_contraction_breakout` |
| 57 | +- `task_16_pe_bollinger_reversal` |
| 58 | +- `task_21_morning_star` |
| 59 | +- `task_22_bullish_sandwich` |
| 60 | +- `task_27_atr_volatility_breakout` |
| 61 | +- `task_31_triple_timeframe_macd` |
| 62 | +- `task_32_weekly_breakout_daily_pullback` |
| 63 | + |
| 64 | +### deepseek-v4-pro[1m] |
| 65 | + |
| 66 | +- `task_11_semiconductor_macd` |
| 67 | +- `task_20_ultimate_multi_condition` |
| 68 | +- `task_21_morning_star` |
| 69 | +- `task_22_bullish_sandwich` |
| 70 | +- `task_23_gentle_volume_rise` |
| 71 | +- `task_24_doji_to_surge` |
| 72 | +- `task_26_ma_convergence_divergence` |
| 73 | +- `task_27_atr_volatility_breakout` |
| 74 | +- `task_28_volatility_compression_explosion` |
| 75 | +- `task_31_triple_timeframe_macd` |
| 76 | +- `task_32_weekly_breakout_daily_pullback` |
| 77 | +- `task_33_sector_leadership` |
| 78 | +- `task_39_interest_rate_sector_rotation` |
| 79 | +- `task_40_commodity_equity_linkage` |
| 80 | + |
| 81 | +## Artifacts |
| 82 | + |
| 83 | +The committed result artifacts intentionally include run summaries and |
| 84 | +`results.json` files only. Large per-task workspaces, transcripts, and tool logs |
| 85 | +remain under the ignored `results/` tree unless explicitly force-added later. |
| 86 | + |
| 87 | +- Full runs: `results/claude_code/` |
| 88 | +- Smoke runs: `results/claude_code_smoke/` |
| 89 | +- Adapter: `bayesian_agent/adapters/claude_code.py` |
| 90 | +- Runner: `experiments/run_benchmarks.py --harness claude-code` |
0 commit comments