Skip to content

chi-binh-ta/adc-bench

Repository files navigation

ADC-bench

ADC-benchAlgorithm Discovery Complexity Benchmark — is a local Python benchmark for evaluating whether AI coding agents can discover, repair, transfer, and certify algorithmic structure, not merely pass visible unit tests.

ADC-bench is built around a simple research question:

Can an AI agent identify the hidden algorithmic structure required to solve a problem under correctness, complexity, transfer, and anti-cheating constraints?

Unlike ordinary coding benchmarks, ADC-bench does not only ask whether code passes tests. It also asks whether the agent understands the algorithmic idea, rejects weaker approaches, explains the invariant, and satisfies the expected complexity.


1. Motivation

Most programming benchmarks evaluate whether a model can produce code that passes unit tests.

ADC-bench focuses on a deeper distinction:

Instance solving       = solve one concrete input
Code patching          = produce code that passes tests
Algorithm discovery    = infer the hidden algorithmic structure
Certification          = explain why the algorithm works
Transfer               = apply the same structure to related variants

ADC-bench is inspired by software engineering benchmarks, but it is not a SWE-bench clone. SWE-style benchmarks usually ask whether an agent can patch real software issues. ADC-bench asks whether an agent can discover the underlying algorithmic structure behind a task.

The central principle is:

Passing tests is not the same as discovering an algorithm.

2. Core Rule of the Game

For each task, an agent receives a problem specification and starter code.

The agent must produce two files:

solution.py
trace.json

The evaluator then checks:

  1. whether the solution passes hidden tests,
  2. whether it satisfies complexity requirements,
  3. whether the trace correctly identifies the algorithm,
  4. whether the solution transfers across related task families,
  5. whether it is robust to adversarial traps,
  6. whether it avoids cheating.

The benchmark therefore evaluates both:

code correctness

and

algorithmic discovery quality

3. Repository Structure

adc-bench/
  README.md
  LICENSE
  requirements.txt
  pyproject.toml

  adc_bench/
    __init__.py
    cli.py
    evaluator.py
    scoring.py
    sandbox.py
    schemas.py
    trace_validation.py
    anti_cheat.py
    report.py

  tasks/
    known/
    composed/
    synthetic/
    adversarial/

  agents/
    agent_v0/
      README.md
      agent.py

  submissions/
    .gitkeep

  reports/
    .gitkeep

  examples/
    run_baseline.py

4. Task Format

Each task lives in:

tasks/<split>/<task_id>/

Each task folder must contain:

problem.md
starter.py
reference_solution.py
public_tests.py
hidden_tests.py
stress_tests.py
expected_trace.json
metadata.json
forbidden_shortcuts.md

Example:

tasks/known/two_sum_hash/
  problem.md
  starter.py
  reference_solution.py
  public_tests.py
  hidden_tests.py
  stress_tests.py
  expected_trace.json
  metadata.json
  forbidden_shortcuts.md

5. Files Visible to the Agent

The agent may use:

problem.md
starter.py
metadata.json
public_tests.py
forbidden_shortcuts.md

These files define the task, the starter implementation, public examples, and forbidden shortcuts.


6. Files Hidden from the Agent

The agent must not read or depend on:

hidden_tests.py
stress_tests.py
reference_solution.py
expected_trace.json

These files are used only by the evaluator.

Reading or hard-coding against these files is considered cheating.


7. Required Submission Format

For a task with ID:

two_sum_hash

the submission should be placed at:

submissions/two_sum_hash/

and must contain:

solution.py
trace.json

8. solution.py

solution.py must define a function named:

def solve(...):
    ...

The function signature must match the task description in problem.md.

Example:

def solve(nums, target):
    seen = set()
    for x in nums:
        if target - x in seen:
            return True
        seen.add(x)
    return False

9. trace.json

trace.json is a structured technical explanation of the discovered algorithm.

It is not private chain-of-thought. It should be a concise, inspectable explanation of the algorithmic decision.

Required schema:

{
  "chosen_algorithm": "...",
  "hypotheses": ["...", "..."],
  "rejected_algorithms": [
    {
      "name": "...",
      "reason": "..."
    }
  ],
  "invariant": "...",
  "complexity_time": "...",
  "complexity_space": "...",
  "edge_cases": ["...", "..."],
  "counterexample_for_wrong_approach": "optional string"
}

Example:

{
  "chosen_algorithm": "hash set lookup",
  "hypotheses": [
    "try all pairs",
    "sort and use two pointers",
    "use a hash set of previously seen values"
  ],
  "rejected_algorithms": [
    {
      "name": "brute force",
      "reason": "O(n^2) time violates the expected O(n) complexity"
    }
  ],
  "invariant": "After processing index i, the set contains exactly the elements before i.",
  "complexity_time": "O(n)",
  "complexity_space": "O(n)",
  "edge_cases": [
    "empty list",
    "duplicate values",
    "negative numbers"
  ],
  "counterexample_for_wrong_approach": ""
}

10. What Counts as Algorithm Discovery?

A high-quality ADC-bench submission should do more than pass tests.

It should identify the intended structure behind the problem.

Examples:

Task Type Expected Discovery
Two Sum Hash set lookup
Valid Parentheses Stack invariant
Graph Reachability BFS or DFS traversal
Positive Weighted Shortest Path Dijkstra relaxation
Sliding Window Maximum Monotonic deque
XOR-SAT Gaussian elimination over GF(2)
2-SAT Implication graph + strongly connected components
Time-dependent Reachability BFS over expanded state (node, time_mod)

A strong submission should explain:

  1. why the chosen algorithm works,
  2. why weaker approaches fail,
  3. what invariant is maintained,
  4. what the time and space complexity are,
  5. which edge cases matter.

11. Scoring

ADC-bench uses the following score:

ADCScore =
    0.35 * correctness
  + 0.20 * complexity
  + 0.15 * trace_quality
  + 0.15 * transfer
  + 0.10 * robustness
  + 0.05 * anti_cheat

Each component is normalized to the range:

[0, 1]

12. Correctness

Correctness measures whether solution.py passes hidden tests.

correctness = hidden_tests_passed / hidden_tests_total

Public tests are useful for debugging, but they are not enough to establish correctness.


13. Complexity

Complexity measures whether the implementation satisfies the expected asymptotic behavior.

This is tested using:

stress_tests.py
timeouts
large inputs
scaling-sensitive cases

Example:

If a task expects O(n) but the submission uses O(n^2), then it may pass public tests but fail stress tests.


14. Trace Quality

Trace quality measures whether trace.json correctly identifies the algorithmic structure.

The validator checks:

chosen_algorithm
hypotheses
rejected_algorithms
invariant
complexity_time
complexity_space
edge_cases
counterexample_for_wrong_approach

The current trace validator is heuristic. It is not a formal proof checker.


15. Transfer

Transfer measures whether the agent performs well across related tasks in the same family.

Example:

shortest_path_family/
  dijkstra_positive_weights
  wrong_hint_shortest_path
  path_reconstruction_variant

An agent receives a higher transfer score if the discovered strategy generalizes across variants rather than only solving one task by accident.


16. Robustness

Robustness measures performance on adversarial or diagonal tasks.

These tasks may contain:

misleading hints
small public tests that allow brute force
hidden complexity traps
ambiguous edge cases
wrong algorithm suggestions

A robust agent should not blindly follow misleading hints.

Example: the problem says "This probably needs BFS", but the graph has positive weights, so the correct method is Dijkstra.


17. Anti-Cheat Rules

ADC-bench includes a lightweight local anti-cheat scanner.

Forbidden behavior includes:

reading hidden_tests.py
reading stress_tests.py
reading reference_solution.py
reading expected_trace.json
modifying evaluator files
hard-coding test outputs
using eval or exec to bypass restrictions
using subprocess to inspect benchmark files
using filesystem introspection to discover hidden files

Suspicious strings may include:

hidden_tests
stress_tests
reference_solution
expected_trace
os.listdir
__file__
inspect
subprocess
eval(
exec(

Important limitation:

The local sandbox is not secure against malicious code.

The current anti-cheat system is designed to catch obvious violations in a research prototype, not to provide strong security isolation.


18. Task Splits

ADC-bench v0.3.1 contains four task splits:

tasks/
  known/
  composed/
  synthetic/
  adversarial/

18.1 Known Tasks

Canonical algorithmic tasks.

Examples:

two_sum_hash
valid_parentheses_stack
graph_reachability_bfs
dijkstra_positive_weights

These tasks test whether an agent can recognize and implement standard algorithmic structures.

18.2 Composed Tasks

Tasks requiring composition of known techniques.

Examples:

sliding_window_max_deque
xor_sat_f2
two_sat_scc

These tasks test whether an agent can combine multiple ideas or choose the correct abstraction.

18.3 Synthetic Tasks

Original or less-standard problems designed to reduce pure memorization.

Examples:

parity_interval_merge
rotating_key_reachability

These tasks test whether an agent can analyze a new problem rather than only recall a known template.

18.4 Adversarial Tasks

Tasks designed to expose shortcut behavior.

Example:

wrong_hint_shortest_path

These tasks may include misleading hints, hidden complexity traps, or cases where a common algorithmic guess fails.


19. Current Initial Task Set

ADC-bench v0.3.1 includes 10 initial algorithm tasks:

known/two_sum_hash
known/valid_parentheses_stack
known/graph_reachability_bfs
known/dijkstra_positive_weights

composed/sliding_window_max_deque
composed/xor_sat_f2
composed/two_sat_scc

synthetic/parity_interval_merge
synthetic/rotating_key_reachability

adversarial/wrong_hint_shortest_path

20. Installation

Clone the repository:

git clone https://github.com/chi-binh-ta/adc-bench.git
cd adc-bench

Install dependencies:

pip install -r requirements.txt

21. Running the Benchmark

List all tasks:

python -m adc_bench.cli list

Run the weak baseline agent:

python examples/run_baseline.py

Run one task:

python -m adc_bench.cli run-task --task tasks/known/two_sum_hash --submission submissions/two_sum_hash

Run all tasks:

python -m adc_bench.cli run-all --submissions submissions

Results are written to:

reports/results.json
reports/summary.md


Standalone Verifier

ADC-bench also includes a standalone verifier module:

python -m adc_bench.verifier <task_dir> <submission_dir> [timeout_seconds]

Example:

python -m adc_bench.verifier tasks/known/two_sum_hash examples/sample_submissions/two_sum_hash 5

The verifier checks a single task submission and prints a JSON report.

It evaluates:

public tests
hidden tests
stress tests
trace.json quality
anti-cheat scan
partial ADC score

A typical successful output looks like:

{
  "task_id": "known/two_sum_hash",
  "passed_public": true,
  "passed_hidden": true,
  "passed_stress": true,
  "correctness": 1.0,
  "complexity": 1.0,
  "trace_quality": 1.0,
  "anti_cheat": 1.0,
  "adc_score_partial": 0.75,
  "errors": []
}

You can also run the verifier through the main ADC-bench CLI:

python -m adc_bench.cli verify --task tasks/known/two_sum_hash --submission examples/sample_submissions/two_sum_hash --timeout 5

For a human-readable report:

python -m adc_bench.cli verify --task tasks/known/two_sum_hash --submission examples/sample_submissions/two_sum_hash --timeout 5 --format text

To write the verifier report to a file:

python -m adc_bench.cli verify --task tasks/known/two_sum_hash --submission examples/sample_submissions/two_sum_hash --timeout 5 --format text --out reports/two_sum_hash_verifier.txt

The standalone verifier is useful for debugging one task before running the full benchmark.

Note: adc_score_partial does not include full transfer and robustness scoring. Those are better computed at the evaluator or family level.

Self-Improving Agent Layer

ADC-bench v0.3.1 includes an offline self-improvement benchmark layer. It evaluates whether a candidate agent improves over a weak baseline agent on held-out algorithm discovery tasks.

Input:

baseline agent directory
candidate agent directory
self-improvement task directory
training task list
held-out task list
failure logs
improvement goal

Output:

baseline_average_score
candidate_average_score
absolute_gain
normalized_gain
no_regression
improvement_trace_quality
anti_cheat
reproducibility
self_improve_score
split_is_disjoint
split_overlap_tasks
train_task_count
heldout_task_count

The scoring formula is:

SelfImproveScore =
    0.50 * normalized_gain
  + 0.20 * no_regression
  + 0.15 * improvement_trace_quality
  + 0.10 * anti_cheat
  + 0.05 * reproducibility

Run the sample self-improvement evaluation:

python -m adc_bench.cli self-improve-eval \
  --baseline-agent agents/agent_v0 \
  --candidate-agent examples/sample_self_improve \
  --task self_improvement/tasks/basic_agent_upgrade \
  --timeout 5 \
  --format text

This layer is offline and reproducible. It does not perform live recursive self-modification; it compares two fixed agent directories after the candidate has already been produced. The anti-cheat checks are lightweight and intended for local research, not as a secure sandbox.

Strict train/heldout split

train_tasks.json and heldout_tasks.json must be disjoint. ADC-bench v0.3.1 enforces this by default for self-improvement evaluation. If a split overlaps, the evaluator returns a structured result with score 0.0, split_is_disjoint set to false, and split_overlap_tasks listing the overlapping tasks.

Use --allow-overlap only for debugging legacy experiments. Even with that flag, the result still reports:

split_is_disjoint
split_overlap_tasks
train_task_count
heldout_task_count

Strict split sample command:

python -m adc_bench.cli self-improve-eval \
  --baseline-agent agents/agent_v0 \
  --candidate-agent examples/sample_self_improve \
  --task self_improvement/tasks/basic_agent_upgrade \
  --timeout 5 \
  --format text

22. Result Format

Each evaluated task produces a result like:

{
  "task_id": "known/two_sum_hash",
  "correctness": 1.0,
  "complexity": 1.0,
  "trace_quality": 0.8,
  "transfer": 0.7,
  "robustness": 1.0,
  "anti_cheat": 1.0,
  "adc_score": 0.91,
  "passed_public": true,
  "passed_hidden": true,
  "passed_stress": true,
  "errors": []
}

23. Adding a New Task

To add a new task, create a new folder:

tasks/<split>/<task_id>/

Required files:

problem.md
starter.py
reference_solution.py
public_tests.py
hidden_tests.py
stress_tests.py
expected_trace.json
metadata.json
forbidden_shortcuts.md

Minimum metadata.json:

{
  "task_id": "my_new_task",
  "family": "my_algorithm_family",
  "split": "known",
  "target_algorithm": "hash set",
  "expected_complexity": "O(n)",
  "difficulty": "easy",
  "requires_invariant": true,
  "requires_counterexample": false,
  "anti_cheat": true
}

A good ADC-bench task should have:

  1. a hidden algorithmic structure,
  2. a naive solution that fails stress tests,
  3. hidden tests that check edge cases,
  4. a clear expected invariant,
  5. a meaningful complexity requirement,
  6. at least one way for a shallow shortcut to fail.

24. Adding a Submission

Create a submission folder:

submissions/<task_id>/

Add:

solution.py
trace.json

Example:

submissions/two_sum_hash/
  solution.py
  trace.json

Then run:

python -m adc_bench.cli run-task --task tasks/known/two_sum_hash --submission submissions/two_sum_hash

25. Baseline Agent

The repository includes a weak baseline agent:

agents/agent_v0/
  agent.py

This agent mostly copies starter code and emits a weak trace. It is intentionally limited.

Its purpose is to verify that the evaluator works, weak agents do not score too highly, and hidden/stress tests catch naive solutions.

Run it with:

python examples/run_baseline.py

26. GitHub Actions Smoke Test

This repository includes a smoke test workflow:

.github/workflows/smoke-test.yml

The workflow checks that dependencies install, tasks can be listed, and the baseline agent can run. This is not a full benchmark run; it is a lightweight sanity check for repository health.


27. Design Philosophy

ADC-bench is based on the following principle:

A model should not be rewarded only for passing tests.
It should be rewarded for discovering the algorithmic reason why the solution works.

A standard coding benchmark asks:

Can the model produce code that passes unit tests?

ADC-bench asks:

Can the model identify, implement, transfer, and explain the algorithmic structure required by the task?

28. What ADC-bench Is Not

ADC-bench is not:

a secure sandbox
a full formal verification system
a replacement for SWE-bench
a replacement for HumanEval
a large-scale production benchmark yet

ADC-bench is currently a research prototype for studying algorithm discovery behavior in AI coding agents.


29. Limitations

ADC-bench v0.3.1 has several known limitations:

  1. The local sandbox is not secure against malicious code.
  2. Trace scoring is heuristic.
  3. Transfer scoring is currently approximate.
  4. The task set is small.
  5. Some tasks may still resemble known competitive-programming patterns.
  6. The self-improvement layer is offline and does not run live recursive agent updates.
  7. The anti-cheat system catches obvious violations, not sophisticated attacks.

30. Roadmap

Planned improvements:

v0.3.0 — Offline self-improving agent layer
v0.3.1 — Strict disjoint self-improvement splits
v0.3.2 — Better improvement traces and richer held-out suites
v0.4   — Stronger synthetic task generation
v0.5   — Human-verified task subset
v1.0   — Stable benchmark release

31. Current Direction: Self-Improving Agent Layer

ADC-bench v0.3.1 includes tasks where an offline candidate agent is evaluated against a weak baseline. The candidate receives:

agent_v0 source code
failure logs
training task list
held-out task list
improvement goal

The agent must produce:

agent.py
improvement_trace.json

The evaluator then checks whether:

Score(candidate_agent) > Score(agent_v0)

on held-out tasks. This moves ADC-bench from algorithm discovery toward reproducible agent improvement while keeping v0.3.1 deterministic and offline.


32. Summary

ADC-bench evaluates five levels of capability:

Level 1 — Code correctness
Level 2 — Complexity awareness
Level 3 — Algorithmic structure discovery
Level 4 — Transfer across problem families
Level 5 — Robustness and future self-improvement

The goal is not to reward code that merely passes visible tests.

The goal is to evaluate whether an AI agent can discover, repair, transfer, certify, and robustly apply algorithmic structure.

About

ADC-bench: A benchmark for algorithm discovery, transfer, and certification in AI coding agents.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages