ADC-bench — Algorithm Discovery Complexity Benchmark — is a local Python benchmark for evaluating whether AI coding agents can discover, repair, transfer, and certify algorithmic structure, not merely pass visible unit tests.
ADC-bench is built around a simple research question:
Can an AI agent identify the hidden algorithmic structure required to solve a problem under correctness, complexity, transfer, and anti-cheating constraints?
Unlike ordinary coding benchmarks, ADC-bench does not only ask whether code passes tests. It also asks whether the agent understands the algorithmic idea, rejects weaker approaches, explains the invariant, and satisfies the expected complexity.
Most programming benchmarks evaluate whether a model can produce code that passes unit tests.
ADC-bench focuses on a deeper distinction:
Instance solving = solve one concrete input
Code patching = produce code that passes tests
Algorithm discovery = infer the hidden algorithmic structure
Certification = explain why the algorithm works
Transfer = apply the same structure to related variants
ADC-bench is inspired by software engineering benchmarks, but it is not a SWE-bench clone. SWE-style benchmarks usually ask whether an agent can patch real software issues. ADC-bench asks whether an agent can discover the underlying algorithmic structure behind a task.
The central principle is:
Passing tests is not the same as discovering an algorithm.
For each task, an agent receives a problem specification and starter code.
The agent must produce two files:
solution.py
trace.json
The evaluator then checks:
- whether the solution passes hidden tests,
- whether it satisfies complexity requirements,
- whether the trace correctly identifies the algorithm,
- whether the solution transfers across related task families,
- whether it is robust to adversarial traps,
- whether it avoids cheating.
The benchmark therefore evaluates both:
code correctness
and
algorithmic discovery quality
adc-bench/
README.md
LICENSE
requirements.txt
pyproject.toml
adc_bench/
__init__.py
cli.py
evaluator.py
scoring.py
sandbox.py
schemas.py
trace_validation.py
anti_cheat.py
report.py
tasks/
known/
composed/
synthetic/
adversarial/
agents/
agent_v0/
README.md
agent.py
submissions/
.gitkeep
reports/
.gitkeep
examples/
run_baseline.py
Each task lives in:
tasks/<split>/<task_id>/
Each task folder must contain:
problem.md
starter.py
reference_solution.py
public_tests.py
hidden_tests.py
stress_tests.py
expected_trace.json
metadata.json
forbidden_shortcuts.md
Example:
tasks/known/two_sum_hash/
problem.md
starter.py
reference_solution.py
public_tests.py
hidden_tests.py
stress_tests.py
expected_trace.json
metadata.json
forbidden_shortcuts.md
The agent may use:
problem.md
starter.py
metadata.json
public_tests.py
forbidden_shortcuts.md
These files define the task, the starter implementation, public examples, and forbidden shortcuts.
6. Files Hidden from the Agent
The agent must not read or depend on:
hidden_tests.py
stress_tests.py
reference_solution.py
expected_trace.json
These files are used only by the evaluator.
Reading or hard-coding against these files is considered cheating.
For a task with ID:
two_sum_hash
the submission should be placed at:
submissions/two_sum_hash/
and must contain:
solution.py
trace.json
solution.py must define a function named:
def solve(...):
...The function signature must match the task description in problem.md.
Example:
def solve(nums, target):
seen = set()
for x in nums:
if target - x in seen:
return True
seen.add(x)
return Falsetrace.json is a structured technical explanation of the discovered algorithm.
It is not private chain-of-thought. It should be a concise, inspectable explanation of the algorithmic decision.
Required schema:
{
"chosen_algorithm": "...",
"hypotheses": ["...", "..."],
"rejected_algorithms": [
{
"name": "...",
"reason": "..."
}
],
"invariant": "...",
"complexity_time": "...",
"complexity_space": "...",
"edge_cases": ["...", "..."],
"counterexample_for_wrong_approach": "optional string"
}Example:
{
"chosen_algorithm": "hash set lookup",
"hypotheses": [
"try all pairs",
"sort and use two pointers",
"use a hash set of previously seen values"
],
"rejected_algorithms": [
{
"name": "brute force",
"reason": "O(n^2) time violates the expected O(n) complexity"
}
],
"invariant": "After processing index i, the set contains exactly the elements before i.",
"complexity_time": "O(n)",
"complexity_space": "O(n)",
"edge_cases": [
"empty list",
"duplicate values",
"negative numbers"
],
"counterexample_for_wrong_approach": ""
}A high-quality ADC-bench submission should do more than pass tests.
It should identify the intended structure behind the problem.
Examples:
| Task Type | Expected Discovery |
|---|---|
| Two Sum | Hash set lookup |
| Valid Parentheses | Stack invariant |
| Graph Reachability | BFS or DFS traversal |
| Positive Weighted Shortest Path | Dijkstra relaxation |
| Sliding Window Maximum | Monotonic deque |
| XOR-SAT | Gaussian elimination over GF(2) |
| 2-SAT | Implication graph + strongly connected components |
| Time-dependent Reachability | BFS over expanded state (node, time_mod) |
A strong submission should explain:
- why the chosen algorithm works,
- why weaker approaches fail,
- what invariant is maintained,
- what the time and space complexity are,
- which edge cases matter.
ADC-bench uses the following score:
ADCScore =
0.35 * correctness
+ 0.20 * complexity
+ 0.15 * trace_quality
+ 0.15 * transfer
+ 0.10 * robustness
+ 0.05 * anti_cheat
Each component is normalized to the range:
[0, 1]
Correctness measures whether solution.py passes hidden tests.
correctness = hidden_tests_passed / hidden_tests_total
Public tests are useful for debugging, but they are not enough to establish correctness.
Complexity measures whether the implementation satisfies the expected asymptotic behavior.
This is tested using:
stress_tests.py
timeouts
large inputs
scaling-sensitive cases
Example:
If a task expects O(n) but the submission uses O(n^2), then it may pass public tests but fail stress tests.
Trace quality measures whether trace.json correctly identifies the algorithmic structure.
The validator checks:
chosen_algorithm
hypotheses
rejected_algorithms
invariant
complexity_time
complexity_space
edge_cases
counterexample_for_wrong_approach
The current trace validator is heuristic. It is not a formal proof checker.
Transfer measures whether the agent performs well across related tasks in the same family.
Example:
shortest_path_family/
dijkstra_positive_weights
wrong_hint_shortest_path
path_reconstruction_variant
An agent receives a higher transfer score if the discovered strategy generalizes across variants rather than only solving one task by accident.
Robustness measures performance on adversarial or diagonal tasks.
These tasks may contain:
misleading hints
small public tests that allow brute force
hidden complexity traps
ambiguous edge cases
wrong algorithm suggestions
A robust agent should not blindly follow misleading hints.
Example: the problem says "This probably needs BFS", but the graph has positive weights, so the correct method is Dijkstra.
ADC-bench includes a lightweight local anti-cheat scanner.
Forbidden behavior includes:
reading hidden_tests.py
reading stress_tests.py
reading reference_solution.py
reading expected_trace.json
modifying evaluator files
hard-coding test outputs
using eval or exec to bypass restrictions
using subprocess to inspect benchmark files
using filesystem introspection to discover hidden files
Suspicious strings may include:
hidden_tests
stress_tests
reference_solution
expected_trace
os.listdir
__file__
inspect
subprocess
eval(
exec(
Important limitation:
The local sandbox is not secure against malicious code.
The current anti-cheat system is designed to catch obvious violations in a research prototype, not to provide strong security isolation.
ADC-bench v0.3.1 contains four task splits:
tasks/
known/
composed/
synthetic/
adversarial/
Canonical algorithmic tasks.
Examples:
two_sum_hash
valid_parentheses_stack
graph_reachability_bfs
dijkstra_positive_weights
These tasks test whether an agent can recognize and implement standard algorithmic structures.
Tasks requiring composition of known techniques.
Examples:
sliding_window_max_deque
xor_sat_f2
two_sat_scc
These tasks test whether an agent can combine multiple ideas or choose the correct abstraction.
Original or less-standard problems designed to reduce pure memorization.
Examples:
parity_interval_merge
rotating_key_reachability
These tasks test whether an agent can analyze a new problem rather than only recall a known template.
Tasks designed to expose shortcut behavior.
Example:
wrong_hint_shortest_path
These tasks may include misleading hints, hidden complexity traps, or cases where a common algorithmic guess fails.
ADC-bench v0.3.1 includes 10 initial algorithm tasks:
known/two_sum_hash
known/valid_parentheses_stack
known/graph_reachability_bfs
known/dijkstra_positive_weights
composed/sliding_window_max_deque
composed/xor_sat_f2
composed/two_sat_scc
synthetic/parity_interval_merge
synthetic/rotating_key_reachability
adversarial/wrong_hint_shortest_path
Clone the repository:
git clone https://github.com/chi-binh-ta/adc-bench.git
cd adc-benchInstall dependencies:
pip install -r requirements.txtList all tasks:
python -m adc_bench.cli listRun the weak baseline agent:
python examples/run_baseline.pyRun one task:
python -m adc_bench.cli run-task --task tasks/known/two_sum_hash --submission submissions/two_sum_hashRun all tasks:
python -m adc_bench.cli run-all --submissions submissionsResults are written to:
reports/results.json
reports/summary.md
ADC-bench also includes a standalone verifier module:
python -m adc_bench.verifier <task_dir> <submission_dir> [timeout_seconds]Example:
python -m adc_bench.verifier tasks/known/two_sum_hash examples/sample_submissions/two_sum_hash 5The verifier checks a single task submission and prints a JSON report.
It evaluates:
public tests
hidden tests
stress tests
trace.json quality
anti-cheat scan
partial ADC score
A typical successful output looks like:
{
"task_id": "known/two_sum_hash",
"passed_public": true,
"passed_hidden": true,
"passed_stress": true,
"correctness": 1.0,
"complexity": 1.0,
"trace_quality": 1.0,
"anti_cheat": 1.0,
"adc_score_partial": 0.75,
"errors": []
}You can also run the verifier through the main ADC-bench CLI:
python -m adc_bench.cli verify --task tasks/known/two_sum_hash --submission examples/sample_submissions/two_sum_hash --timeout 5For a human-readable report:
python -m adc_bench.cli verify --task tasks/known/two_sum_hash --submission examples/sample_submissions/two_sum_hash --timeout 5 --format textTo write the verifier report to a file:
python -m adc_bench.cli verify --task tasks/known/two_sum_hash --submission examples/sample_submissions/two_sum_hash --timeout 5 --format text --out reports/two_sum_hash_verifier.txtThe standalone verifier is useful for debugging one task before running the full benchmark.
Note: adc_score_partial does not include full transfer and robustness scoring. Those are better computed at the evaluator or family level.
ADC-bench v0.3.1 includes an offline self-improvement benchmark layer. It evaluates whether a candidate agent improves over a weak baseline agent on held-out algorithm discovery tasks.
Input:
baseline agent directory
candidate agent directory
self-improvement task directory
training task list
held-out task list
failure logs
improvement goal
Output:
baseline_average_score
candidate_average_score
absolute_gain
normalized_gain
no_regression
improvement_trace_quality
anti_cheat
reproducibility
self_improve_score
split_is_disjoint
split_overlap_tasks
train_task_count
heldout_task_count
The scoring formula is:
SelfImproveScore =
0.50 * normalized_gain
+ 0.20 * no_regression
+ 0.15 * improvement_trace_quality
+ 0.10 * anti_cheat
+ 0.05 * reproducibility
Run the sample self-improvement evaluation:
python -m adc_bench.cli self-improve-eval \
--baseline-agent agents/agent_v0 \
--candidate-agent examples/sample_self_improve \
--task self_improvement/tasks/basic_agent_upgrade \
--timeout 5 \
--format textThis layer is offline and reproducible. It does not perform live recursive self-modification; it compares two fixed agent directories after the candidate has already been produced. The anti-cheat checks are lightweight and intended for local research, not as a secure sandbox.
train_tasks.json and heldout_tasks.json must be disjoint. ADC-bench v0.3.1
enforces this by default for self-improvement evaluation. If a split overlaps,
the evaluator returns a structured result with score 0.0, split_is_disjoint
set to false, and split_overlap_tasks listing the overlapping tasks.
Use --allow-overlap only for debugging legacy experiments. Even with that
flag, the result still reports:
split_is_disjoint
split_overlap_tasks
train_task_count
heldout_task_count
Strict split sample command:
python -m adc_bench.cli self-improve-eval \
--baseline-agent agents/agent_v0 \
--candidate-agent examples/sample_self_improve \
--task self_improvement/tasks/basic_agent_upgrade \
--timeout 5 \
--format textEach evaluated task produces a result like:
{
"task_id": "known/two_sum_hash",
"correctness": 1.0,
"complexity": 1.0,
"trace_quality": 0.8,
"transfer": 0.7,
"robustness": 1.0,
"anti_cheat": 1.0,
"adc_score": 0.91,
"passed_public": true,
"passed_hidden": true,
"passed_stress": true,
"errors": []
}To add a new task, create a new folder:
tasks/<split>/<task_id>/
Required files:
problem.md
starter.py
reference_solution.py
public_tests.py
hidden_tests.py
stress_tests.py
expected_trace.json
metadata.json
forbidden_shortcuts.md
Minimum metadata.json:
{
"task_id": "my_new_task",
"family": "my_algorithm_family",
"split": "known",
"target_algorithm": "hash set",
"expected_complexity": "O(n)",
"difficulty": "easy",
"requires_invariant": true,
"requires_counterexample": false,
"anti_cheat": true
}A good ADC-bench task should have:
- a hidden algorithmic structure,
- a naive solution that fails stress tests,
- hidden tests that check edge cases,
- a clear expected invariant,
- a meaningful complexity requirement,
- at least one way for a shallow shortcut to fail.
Create a submission folder:
submissions/<task_id>/
Add:
solution.py
trace.json
Example:
submissions/two_sum_hash/
solution.py
trace.json
Then run:
python -m adc_bench.cli run-task --task tasks/known/two_sum_hash --submission submissions/two_sum_hashThe repository includes a weak baseline agent:
agents/agent_v0/
agent.py
This agent mostly copies starter code and emits a weak trace. It is intentionally limited.
Its purpose is to verify that the evaluator works, weak agents do not score too highly, and hidden/stress tests catch naive solutions.
Run it with:
python examples/run_baseline.pyThis repository includes a smoke test workflow:
.github/workflows/smoke-test.yml
The workflow checks that dependencies install, tasks can be listed, and the baseline agent can run. This is not a full benchmark run; it is a lightweight sanity check for repository health.
ADC-bench is based on the following principle:
A model should not be rewarded only for passing tests.
It should be rewarded for discovering the algorithmic reason why the solution works.
A standard coding benchmark asks:
Can the model produce code that passes unit tests?
ADC-bench asks:
Can the model identify, implement, transfer, and explain the algorithmic structure required by the task?
ADC-bench is not:
a secure sandbox
a full formal verification system
a replacement for SWE-bench
a replacement for HumanEval
a large-scale production benchmark yet
ADC-bench is currently a research prototype for studying algorithm discovery behavior in AI coding agents.
ADC-bench v0.3.1 has several known limitations:
- The local sandbox is not secure against malicious code.
- Trace scoring is heuristic.
- Transfer scoring is currently approximate.
- The task set is small.
- Some tasks may still resemble known competitive-programming patterns.
- The self-improvement layer is offline and does not run live recursive agent updates.
- The anti-cheat system catches obvious violations, not sophisticated attacks.
Planned improvements:
v0.3.0 — Offline self-improving agent layer
v0.3.1 — Strict disjoint self-improvement splits
v0.3.2 — Better improvement traces and richer held-out suites
v0.4 — Stronger synthetic task generation
v0.5 — Human-verified task subset
v1.0 — Stable benchmark release
ADC-bench v0.3.1 includes tasks where an offline candidate agent is evaluated against a weak baseline. The candidate receives:
agent_v0 source code
failure logs
training task list
held-out task list
improvement goal
The agent must produce:
agent.py
improvement_trace.json
The evaluator then checks whether:
Score(candidate_agent) > Score(agent_v0)
on held-out tasks. This moves ADC-bench from algorithm discovery toward reproducible agent improvement while keeping v0.3.1 deterministic and offline.
ADC-bench evaluates five levels of capability:
Level 1 — Code correctness
Level 2 — Complexity awareness
Level 3 — Algorithmic structure discovery
Level 4 — Transfer across problem families
Level 5 — Robustness and future self-improvement
The goal is not to reward code that merely passes visible tests.
The goal is to evaluate whether an AI agent can discover, repair, transfer, certify, and robustly apply algorithmic structure.