Project Ditto v2

A replication-and-generality test for the Ditto constraint-chain abstraction, applied to programming-task agent trajectories from Terminal-Bench 2.0, SWE-bench Verified, and human debugging sessions.

Builds on Project Ditto v1 (Pokémon telemetry). The experiment tests whether the same real-vs-shuffled detectability effect reproduces on programming telemetry using the same six constraint types.

Pre-registration: SPEC.md + SPEC.pdf — thresholds and methodology are frozen before evaluation.

What It Does

Agent trajectories (shell commands, file edits, test runs) are translated into abstract constraint chains via T-code — six constraint types that capture resource budgets, tool availability, sub-goal transitions, information state, coordination dependencies, and optimization criteria. Chains are rendered in domain-blind abstract English (no programming vocabulary), then real chains are compared against shuffled controls. Claude Haiku and Sonnet are asked to predict the next constraint at a mid-chain cutoff; higher accuracy on real vs shuffled chains is the effect of interest.

Status

Source	Trajectories	Real Chains	Gate 3	Gate 4
Terminal-Bench 2.0 + extras	4,258	170	PASS	PASS (100%)
SWE-bench Verified	5,000	1,016	PASS	PASS (100%)
Human sessions (SpecStory)	in progress	pending	pending	pending

All gates except human: green. Gate 5 (live API dry-run): PASS.

Pipeline

data/<source>/trajectories.jsonl
  → scripts/acquire_<source>.py       # HuggingFace / GitHub acquisition
  → src/parser_<source>.py            # → TrajectoryLog
  → src/aggregation.py                # event compression
  → src/translation.py                # T-code: events → Constraints  [FROZEN]
  → src/observability.py              # asymmetric information reveal
  → src/filter.py                     # is_valid_chain()
  → src/renderer.py                   # abstract English (leakage check)
  → src/shuffler.py                   # seeded shuffled control variants
chains/real/<source>/*.jsonl
chains/shuffled/<source>/*.jsonl
  → src/reference.py                  # StateSignature → action distribution
data/reference_<source>.pkl
  → src/runner.py / scripts/run_evaluation.py   # Batches API evaluation
results/raw/ + results/blinded/
  → src/scorer.py                     # Layer 1/2/3 scoring (Session 13)
results/scored.json

T-code is frozen at tag T-code-v1.0-frozen. src/translation.py, src/aggregation.py, and src/renderer.py must not change after this tag.

Evaluation Models

Model	ID
Claude Haiku 4.5	`claude-haiku-4-5-20251001`
Claude Sonnet 4.6	`claude-sonnet-4-6`

Seeds: 42 (T=0.0), 1337 and 7919 (T=0.5). Evaluation via Anthropic Messages Batches API (50% cost reduction).

Quick Start

pip install -r requirements.txt

# Build chains (after data acquisition)
python scripts/build_chains.py --source swe --data data/swe_bench_verified/ \
  --out-real chains/real/swe/ --out-shuffled chains/shuffled/swe/ --gate3

# Build reference distributions
python -m src.reference build-raw --source swe --raw chains/real/swe/ --out data/reference_swe.pkl
python -m src.reference check --dist data/reference_swe.pkl --chains chains/real/swe/ --target 0.9

# Dry-run evaluation (5 chains)
python -m src.runner --model haiku --source swe --chains chains/real/swe/ --seed 42 --dry-run --n 5

# Full evaluation (Session 12)
python scripts/run_evaluation.py

# Run tests
pytest tests/

Key Files

File	Purpose
`SPEC.md`	Pre-registered methodology and thresholds (frozen)
`SPEC.pdf`	Immutable PDF of pre-registration
`CLAUDE.md`	Architecture, session protocol, invariants
`SESSION_LOG.md`	Per-session progress log
`BUILD_PLAN.md`	Session-by-session plan
`src/translation.py`	T-code (frozen at `T-code-v1.0-frozen`)
`scripts/run_evaluation.py`	Full evaluation entry point (Session 12)

Constraint Types (identical to v1)

ResourceBudget · ToolAvailability · SubGoalTransition · InformationState · CoordinationDependency · OptimizationCriterion

Pre-registration v1.1 — 2026-04-22. See Project Ditto v1 for the original Pokémon telemetry experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
chains		chains
data		data
results		results
scripts		scripts
src		src
supplementary		supplementary
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CORRECTED_SCORING.md		CORRECTED_SCORING.md
CORRECTED_SCORING_v2.md		CORRECTED_SCORING_v2.md
LICENSE		LICENSE
PROPOSED_CHANGES.md		PROPOSED_CHANGES.md
README.md		README.md
RESULTS.md		RESULTS.md
SESSION_LOG.md		SESSION_LOG.md
SPEC.md		SPEC.md
SPEC.pdf		SPEC.pdf
SUPPLEMENTARY.md		SUPPLEMENTARY.md
VERIFICATION_REPORT.md		VERIFICATION_REPORT.md
WRITEUP.md		WRITEUP.md
ditto_v2_build_plan.pdf		ditto_v2_build_plan.pdf
ditto_v2_programming_spec_v11.pdf		ditto_v2_programming_spec_v11.pdf
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Ditto v2

What It Does

Status

Pipeline

Evaluation Models

Quick Start

Key Files

Constraint Types (identical to v1)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Ditto v2

What It Does

Status

Pipeline

Evaluation Models

Quick Start

Key Files

Constraint Types (identical to v1)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages