SpecLeft Delta Demo

A controlled experiment comparing autonomous AI agent workflows — with and without SpecLeft — on the same project, same models, same requirements.

The question: Can AI agents honestly assess the quality of their own output?

The answer: No. But with structured specs, they get significantly closer.

The Experiment

Project: Document Approval Workflow API (FastAPI + SQLAlchemy + SQLite + pytest) Models: Claude Opus 4.6, GPT-5.2-Codex Coding Agent: OpenCode 1.1.36 SpecLeft Version: 0.2.2

Two workflows, run twice each:

Baseline — Agent receives a well-written prompt and the PRD. No specs, no scaffolding.
SpecLeft — Agent externalises behaviour into specs before writing code. Test skeletons generated from scenarios.

Same starting commit, same PRD (5 features, 20 scenarios), same controlled variables.

Results

Metric	Codex Baseline	Codex + SpecLeft	Opus Baseline	Opus + SpecLeft
Total tokens	53,000	146,000	83,243	~147,000
Total tests passed	19	27	53	27
Failed test runs	4	2	2	1
Bugs found during retro	0	0	0	3
Bugs actually fixed	0	0	0	1
Missing Requirements	0	0	1	0

Key Findings

Benefits of SpecLeft

Agents find bugs they'd otherwise ship. Opus with SpecLeft caught 3 real defects during behaviour verification — including a Python truthiness trap (timeout_hours=0 treated as falsy) that the baseline run shipped silently. The baseline agent reported 0 issues because it had no spec to verify against.

TDD emerges naturally. Codex with SpecLeft generated test skeletons via specleft test skeleton, then wrote functional assertions before implementation code — genuine test-driven development without being prompted to do it. The scaffolding structure guided the behaviour.

Fewer failed test runs. Both agents had fewer test failures with SpecLeft. Codex dropped from 4 to 2 failed runs. Opus from 2 to 1. The upfront planning reduced mid-implementation corrections.

Traceability from requirement to test. After the SpecLeft runs, specleft status maps every scenario to its implementation state and priority. After the baseline runs, the only way to check requirement coverage is to read every test manually.

Self-assessment becomes meaningful. Baseline retrospectives were generic ("logic gaps caught early"). SpecLeft retrospectives referenced specific bugs with line numbers, named the root causes, and proposed targeted fixes. The specs gave the agent a framework to evaluate itself against.

Design decisions made explicit. The PRD had open questions (escalation depth cap, reviewer preservation on resubmit, delegation-escalation interaction). SpecLeft forced these to be resolved during spec externalisation. The baseline agents resolved them implicitly during coding — with no record of what was decided or why.

Improvement Areas

These are known trade-offs from v0.2.2 that SpecLeft is actively working to address.

Token overhead during spec externalisation. The spec planning phase consumed 45k–49k tokens — agents read back every generated spec file to verify output. Reducing this through smarter context management (summary responses, on-demand spec loading) is a priority for the MCP server integration in v0.3.0.

Overall token cost is higher. SpecLeft runs used 2–3x more tokens than baseline. The trade-off is quality for cost — fewer hidden defects, but more tokens consumed. Optimising the planning-to-implementation token ratio is an active area of improvement.

Time to completion increases. The externalisation phase adds time. Codex went from ~18m to ~38m. Opus from ~14m to ~21m. As the MCP server enables on-demand spec access (rather than upfront loading), this overhead should decrease.

Agent reads all specs regardless of relevance. Both agents loaded the full spec surface into context even when implementing a single scenario. A targeted approach — agent requests only the next scenario via MCP — would significantly reduce per-turn context cost. Estimated improvement: 80% reduction in spec-related overhead.

Experiment Log

This is a living document. As SpecLeft evolves, new experiment runs will be added here.

Version	Date	Experiment	Key Change	Result
v0.2.2	Feb 2026	CLI workflow vs baseline	Initial comparison	Quality improved, tokens 2-3x higher
v0.3.0	Planned	MCP workflow vs CLI	On-demand spec loading via MCP server	Target: 80% reduction in spec overhead

Repo Structure

specleft-delta-demo/
├── without-specleft/       # Baseline agent output
├── with-specleft/          # SpecLeft-assisted agent output
├── prd.md                  # Shared product requirements
├── SKILLS.md               # Agent instructions
└── README.md

Run It Yourself

pip install specleft

# Baseline: give the agent prd.md and let it build
# SpecLeft: initialise specs first, then build against them
specleft init
specleft plan --from prd.md
specleft status

Full write-up:

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
with-specleft		with-specleft
without-specleft		without-specleft
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
AGENT_RESULTS.md		AGENT_RESULTS.md
LICENSE		LICENSE
PRD.md		PRD.md
PROMPTS.md		PROMPTS.md
README.md		README.md
SKILLS.md		SKILLS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpecLeft Delta Demo

The Experiment

Results

Key Findings

Benefits of SpecLeft

Improvement Areas

Experiment Log

Repo Structure

Run It Yourself

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpecLeft Delta Demo

The Experiment

Results

Key Findings

Benefits of SpecLeft

Improvement Areas

Experiment Log

Repo Structure

Run It Yourself

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages