Skip to content

Commit 73df23c

Browse files
committed
Document Claude Code benchmark results
1 parent 23dfe7e commit 73df23c

1 file changed

Lines changed: 90 additions & 0 deletions

File tree

docs/experiments/claude-code.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Claude Code Benchmark Results
2+
3+
This page records the Claude Code compatibility-backend runs available in the
4+
local experiment artifacts. Bayesian-Agent handled benchmark orchestration,
5+
grading, and result writing; Claude Code executed each task prompt through the
6+
adapter boundary.
7+
8+
Evidence type: newly run / verified local artifact. Metrics were read from
9+
`results/claude_code/**/baseline/results.json` and
10+
`results/claude_code_smoke/**/baseline/results.json`.
11+
12+
## Commands
13+
14+
Full baseline runs used this harness shape:
15+
16+
```bash
17+
python experiments/run_benchmarks.py \
18+
--harness claude-code \
19+
--model "$MODEL" \
20+
--bench "$BENCH" \
21+
--mode baseline \
22+
--out-root "results/claude_code/${MODEL_SLUG}/${BENCH}"
23+
```
24+
25+
Smoke runs used the same backend with `--limit 1` and smoke-specific output
26+
roots under `results/claude_code_smoke/`.
27+
28+
## Full Baseline Results
29+
30+
| Model | Benchmark | Accuracy | Total Tokens | Efficiency | Failed Tasks |
31+
|---|---:|---:|---:|---:|---|
32+
| deepseek-v4-flash | SOP-Bench | 18/20 (90%) | 5,887,709 | 3.06 | `sop_15`, `sop_16` |
33+
| deepseek-v4-flash | Lifelong AgentBench | 20/20 (100%) | 1,552,970 | 12.88 | none |
34+
| deepseek-v4-flash | RealFin-Bench | 31/40 (77.5%) | 49,414,353 | 0.63 | 9 tasks |
35+
| deepseek-v4-pro[1m] | SOP-Bench | 13/20 (65%) | 2,759,895 | 4.71 | 7 tasks |
36+
| deepseek-v4-pro[1m] | Lifelong AgentBench | 20/20 (100%) | 1,552,632 | 12.88 | none |
37+
| deepseek-v4-pro[1m] | RealFin-Bench | 26/40 (65%) | 27,034,780 | 0.96 | 14 tasks |
38+
39+
## Smoke Results
40+
41+
| Model | Benchmark | Accuracy | Total Tokens | Efficiency | Failed Tasks |
42+
|---|---:|---:|---:|---:|---|
43+
| deepseek-v4-flash | SOP-Bench | 1/1 (100%) | 276,483 | 3.62 | none |
44+
| deepseek-v4-flash | Lifelong AgentBench | 1/1 (100%) | 75,551 | 13.24 | none |
45+
| deepseek-v4-flash | RealFin-Bench | 0/1 (0%) | 686,454 | 0.00 | `task_01_macd_rsi_filter` |
46+
| deepseek-v4-pro[1m] | SOP-Bench | 1/1 (100%) | 149,332 | 6.70 | none |
47+
| deepseek-v4-pro[1m] | Lifelong AgentBench | 1/1 (100%) | 75,565 | 13.23 | none |
48+
| deepseek-v4-pro[1m] | RealFin-Bench | 0/1 (0%) | 1,040,695 | 0.00 | `task_01_macd_rsi_filter` |
49+
50+
## RealFin Failures
51+
52+
### deepseek-v4-flash
53+
54+
- `task_01_macd_rsi_filter`
55+
- `task_11_semiconductor_macd`
56+
- `task_15_volume_price_contraction_breakout`
57+
- `task_16_pe_bollinger_reversal`
58+
- `task_21_morning_star`
59+
- `task_22_bullish_sandwich`
60+
- `task_27_atr_volatility_breakout`
61+
- `task_31_triple_timeframe_macd`
62+
- `task_32_weekly_breakout_daily_pullback`
63+
64+
### deepseek-v4-pro[1m]
65+
66+
- `task_11_semiconductor_macd`
67+
- `task_20_ultimate_multi_condition`
68+
- `task_21_morning_star`
69+
- `task_22_bullish_sandwich`
70+
- `task_23_gentle_volume_rise`
71+
- `task_24_doji_to_surge`
72+
- `task_26_ma_convergence_divergence`
73+
- `task_27_atr_volatility_breakout`
74+
- `task_28_volatility_compression_explosion`
75+
- `task_31_triple_timeframe_macd`
76+
- `task_32_weekly_breakout_daily_pullback`
77+
- `task_33_sector_leadership`
78+
- `task_39_interest_rate_sector_rotation`
79+
- `task_40_commodity_equity_linkage`
80+
81+
## Artifacts
82+
83+
The committed result artifacts intentionally include run summaries and
84+
`results.json` files only. Large per-task workspaces, transcripts, and tool logs
85+
remain under the ignored `results/` tree unless explicitly force-added later.
86+
87+
- Full runs: `results/claude_code/`
88+
- Smoke runs: `results/claude_code_smoke/`
89+
- Adapter: `bayesian_agent/adapters/claude_code.py`
90+
- Runner: `experiments/run_benchmarks.py --harness claude-code`

0 commit comments

Comments
 (0)