Skip to content

Commit db0973e

Browse files
committed
data: OPE-99 benchmark -- 3-run results (publication quality)
Results from 30 Claude API calls (5 tasks x 3 runs x 2 conditions): Without AGENTS.md: 27/33 checks passed (81.8%) With AGENTS.md: 33/33 checks passed (100.0%) Improvement: 100% fewer violations (consistent across all 3 runs) Two violations found (both deterministic, 3/3 times): 1. install_package: Claude used 'npm install date-fns' without context. With AGENTS.md ('Package manager: bun -- always use bun install, never npm'): Claude used 'bun add date-fns'. Package manager is codebase-specific -- Claude cannot know this from training data. 2. frontend_component: Claude used className template literals without context. With AGENTS.md ('Class merging: use cn() from @/lib/utils'): Claude used cn() correctly. This is a shadcn/ui convention, not universal. Tasks that passed in both conditions (Claude already knew from training): - FastAPI Depends() auth pattern (well-known FastAPI convention) - snake_case Python functions (general Python knowledge) - HTTPException for errors (documented FastAPI pattern) - pytest fixtures (standard pytest knowledge) Conclusion: AGENTS.md matters most for PROJECT-SPECIFIC conventions that Claude can't know from training data alone. Generic patterns (FastAPI Depends, snake_case) Claude already knows. Specific tools (bun, cn()) and local conventions require context.
1 parent 88aa268 commit db0973e

2 files changed

Lines changed: 273 additions & 29 deletions

File tree

benchmark/benchmark_report.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,16 @@
22

33
**Repo:** `repo1-fastapi-template` (tiangolo/full-stack-fastapi-template)
44
**Model:** claude-sonnet-4-20250514
5-
**Tasks:** {'backend_endpoint': {'title': 'Add a FastAPI endpoint to list user items', 'avg_violations_without': 0.0, 'avg_violations_with': 0.0, 'improvement': 0.0, 'checks': {'uses_httpexception': {'description': 'Errors use HTTPException', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'snake_case_functions': {'description': 'Python functions use snake_case', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'depends_auth': {'description': 'Auth uses Depends() not manual headers', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'no_print': {'description': 'Uses logging not print()', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}}}, 'frontend_component': {'title': 'Add a React component for a user profile card', 'avg_violations_without': 1.0, 'avg_violations_with': 0.0, 'improvement': 1.0, 'checks': {'uses_usequery': {'description': 'Data fetching uses useQuery not raw fetch', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'uses_cn': {'description': 'Conditional classes use cn()', 'pass_rate_without_context': 0.0, 'pass_rate_with_context': 1.0, 'improvement': 1.0}}}, 'install_package': {'title': 'Install the date-fns package', 'avg_violations_without': 1.0, 'avg_violations_with': 0.0, 'improvement': 1.0, 'checks': {'uses_bun_not_npm': {'description': 'Install uses bun not npm/yarn', 'pass_rate_without_context': 0.0, 'pass_rate_with_context': 1.0, 'improvement': 1.0}}}, 'write_test': {'title': 'Write a pytest test for the items endpoint', 'avg_violations_without': 0.0, 'avg_violations_with': 0.0, 'improvement': 0.0, 'checks': {'uses_pytest_fixtures': {'description': 'Tests use pytest fixtures not setUp/tearDown', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'snake_case_functions': {'description': 'Test functions use snake_case', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}}}, 'auth_endpoint': {'title': 'Add auth to an existing endpoint', 'avg_violations_without': 0.0, 'avg_violations_with': 0.0, 'improvement': 0.0, 'checks': {'depends_auth': {'description': 'Auth uses Depends() not manual headers', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'uses_httpexception': {'description': 'Errors use HTTPException', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}}}} | **Runs per task:** 1
6-
**Total checks:** 11 per condition
5+
**Tasks:** {'backend_endpoint': {'title': 'Add a FastAPI endpoint to list user items', 'avg_violations_without': 0.0, 'avg_violations_with': 0.0, 'improvement': 0.0, 'checks': {'uses_httpexception': {'description': 'Errors use HTTPException', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'snake_case_functions': {'description': 'Python functions use snake_case', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'depends_auth': {'description': 'Auth uses Depends() not manual headers', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'no_print': {'description': 'Uses logging not print()', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}}}, 'frontend_component': {'title': 'Add a React component for a user profile card', 'avg_violations_without': 1.0, 'avg_violations_with': 0.0, 'improvement': 1.0, 'checks': {'uses_usequery': {'description': 'Data fetching uses useQuery not raw fetch', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'uses_cn': {'description': 'Conditional classes use cn()', 'pass_rate_without_context': 0.0, 'pass_rate_with_context': 1.0, 'improvement': 1.0}}}, 'install_package': {'title': 'Install the date-fns package', 'avg_violations_without': 1.0, 'avg_violations_with': 0.0, 'improvement': 1.0, 'checks': {'uses_bun_not_npm': {'description': 'Install uses bun not npm/yarn', 'pass_rate_without_context': 0.0, 'pass_rate_with_context': 1.0, 'improvement': 1.0}}}, 'write_test': {'title': 'Write a pytest test for the items endpoint', 'avg_violations_without': 0.0, 'avg_violations_with': 0.0, 'improvement': 0.0, 'checks': {'uses_pytest_fixtures': {'description': 'Tests use pytest fixtures not setUp/tearDown', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'snake_case_functions': {'description': 'Test functions use snake_case', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}}}, 'auth_endpoint': {'title': 'Add auth to an existing endpoint', 'avg_violations_without': 0.0, 'avg_violations_with': 0.0, 'improvement': 0.0, 'checks': {'depends_auth': {'description': 'Auth uses Depends() not manual headers', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}, 'uses_httpexception': {'description': 'Errors use HTTPException', 'pass_rate_without_context': 1.0, 'pass_rate_with_context': 1.0, 'improvement': 0.0}}}} | **Runs per task:** 3
6+
**Total checks:** 33 per condition
77

88
## Summary
99

1010
| Condition | Checks passed | Avg violations/task |
1111
|---|---|---|
12-
| Without AGENTS.md | 9/11 (81.8%) | 0.4 |
13-
| With AGENTS.md | 11/11 (100.0%) | 0.0 |
14-
| **Improvement** | **+2 checks** | **100.0% fewer violations** |
12+
| Without AGENTS.md | 27/33 (81.8%) | 0.4 |
13+
| With AGENTS.md | 33/33 (100.0%) | 0.0 |
14+
| **Improvement** | **+6 checks** | **100.0% fewer violations** |
1515

1616
## Per-task breakdown
1717

0 commit comments

Comments
 (0)