exe-build-skills is an open-source Spec-Driven Development (SDD) + Test-Driven Development (TDD) pipeline that turns a feature idea or design document into production-ready, verified code with full requirement traceability. It runs as plain Markdown skill files inside Claude Code, OpenCode, or any AI coding agent — no dependencies, no API keys, no compiled code. Every instruction is auditable text.
The pipeline generates EARS requirements, acceptance criteria, executable tests, implementation, and automated verification with 10+ validation gates and 8 domain-specialist fix agents. Designed for Claude Opus 4.6's 1M context window. (Updated March 18, 2026)
Recommended workflow: Before running the pipeline, use Perplexity or the latest frontier models to research what you want to build. Produce a design document or PRD in Markdown, then feed it into
exe-build-e2eto activate the full pipeline. You can install the Perplexity MCP server and Context7 MCP server into Claude Code to research and generate the design doc without leaving the terminal.
"Add user authentication with OAuth"
│
▼
┌─ spec-writer ──────────────────────────────────────┐
│ requirements.md → design.md → tasks.md │
│ (EARS methodology, 3 validation gates) │
└────────────────────────────────────────────────────┘
│
▼
┌─ ac-writer ────────────────────────────────────────┐
│ scenarios.md → ac.json │
│ (Gherkin scenarios, acceptance criteria) │
└────────────────────────────────────────────────────┘
│
▼
┌─ test-writer ──────────────────────────────────────┐
│ evaluation.json → tests/*.test.ts │
│ (executable tests in your project's framework) │
└────────────────────────────────────────────────────┘
│
▼
┌─ Implementation ───────────────────────────────────┐
│ TDD: write code to make the generated tests pass │
└────────────────────────────────────────────────────┘
│
▼
┌─ Verification ─────────────────────────────────────┐
│ test-runner → code-evaluator → fix-dispatcher │
│ (8 specialist sub-agents, up to 3 fix iterations) │
└────────────────────────────────────────────────────┘
- Quick Start
- Installation
- Usage
- Why This Exists
- How It Works
- Design Decisions and Trade-offs
- How exe-build Compares
- Token Usage and Model Considerations
- Security and Auditability
- Project Structure
- Roadmap
- Contributing
- FAQ
# 1. Clone exe-build-skills
git clone https://github.com/AskExe/exe-build-skills.git
# 2. Copy skills + sub-agents into your project
cp -r exe-build-skills/.claude/skills/ your-project/.claude/skills/
cp -r exe-build-skills/.claude/subagents/ your-project/.claude/subagents/
# 3. Open Claude Code in your project and run:
# /exe-build-e2eThat's it. No npm install, no API keys, no config. The pipeline auto-detects your language, framework, and test runner. Feed it a feature idea or a design document and it generates EARS requirements, acceptance criteria, tests, implementation, and verification — fully autonomous after one confirmation checkpoint.
Copy the .claude/skills/ directory into your project:
# Clone the repo
git clone https://github.com/AskExe/exe-build-skills.git
# Copy skills into your project
cp -r exe-build-skills/.claude/skills/ your-project/.claude/skills/
# Copy sub-agents (needed for fix-dispatcher)
cp -r exe-build-skills/.claude/subagents/ your-project/.claude/subagents/Claude Code automatically discovers skills from .claude/skills/*/SKILL.md.
Copy the .opencode/skills/ directory into your project:
cp -r exe-build-skills/.opencode/skills/ your-project/.opencode/skills/OpenCode discovers skills from .opencode/skills/*/SKILL.md with the same format.
The skills are standard Markdown files. Copy .claude/skills/ into whatever directory your agent reads for instructions.
your-project/
├── .claude/
│ ├── skills/ # 9 pipeline skills
│ │ ├── spec-writer/SKILL.md
│ │ ├── ac-writer/SKILL.md
│ │ ├── test-writer/SKILL.md
│ │ ├── test-runner/SKILL.md
│ │ ├── code-evaluator/SKILL.md
│ │ ├── code-reviewer/SKILL.md
│ │ ├── fix-dispatcher/SKILL.md
│ │ ├── ac-fix/SKILL.md
│ │ └── exe-build-e2e/SKILL.md
│ └── subagents/ # 8 specialist fix sub-agents
│ ├── validation-fix.md
│ ├── auth-fix.md
│ ├── state-fix.md
│ ├── error-handling-fix.md
│ ├── concurrency-fix.md
│ ├── data-fix.md
│ ├── integration-fix.md
│ └── security-fix.md
No dependencies. No npm install. No API keys. Just Markdown files.
Tell Claude Code to run the full pipeline:
> Run the exe-build-e2e pipeline for: user authentication with OAuth
Or use the slash command:
> /exe-build-e2e
The pipeline will:
- Ask for the feature name and classify complexity (Tier 1/2/3)
- Present the classification for your review — you can override ("Make it Tier 2")
- After you confirm, run fully autonomously: spec → AC → tests → implement → verify → fix
If you have a PRD, design doc, or research document:
> Run exe-build-e2e for this design doc: @docs/auth-redesign.md
The pipeline enters Parse Mode — it extracts requirements from your document instead of asking questions, fills gaps with reasonable assumptions (noted in requirements.md), and runs autonomously from there.
Run any skill standalone:
> /spec-writer # Generate requirements, design, tasks from an idea
> /ac-writer # Generate acceptance criteria from existing specs
> /test-writer # Generate tests from acceptance criteria
> /test-runner # Execute tests and collect results
> /code-evaluator # Analyze test failures and categorize defects
> /fix-dispatcher # Dispatch specialists to fix categorized defects
> /code-reviewer # LLM-native code review against specs
> Continue the pipeline from Phase 2 for user-auth # Resume from a specific phase
> Run Phase 4 only for user-auth # Just verify an implementation
> What's the pipeline status for user-auth? # Check which artifacts exist
> Run exe-build-e2e for adding a logout button, make it Tier 1
Tier 1 skips design.md, ac-writer, and code-evaluator — fast track for simple features.
All artifacts are written to docs/features/<feature-name>/:
docs/features/user-auth-oauth/
├── requirements.md # EARS atomic requirements (REQ-001, REQ-002, ...)
├── design.md # Architecture, state diagrams, API contracts
├── tasks.md # Implementation plan with test hooks
├── scenarios.md # Gherkin test scenarios
├── ac.json # Machine-readable acceptance criteria
├── evaluation.json # AC coverage evaluation
├── test-results.json # Test execution results
├── code-evaluation.json # Defect analysis
├── code-report.md # Human-readable failure report
├── fix-report.json # Summary of applied fixes
└── tests/
└── *.test.ts # Executable tests (language auto-detected)
AI coding agents write code that passes tests but misses requirements. exe-build-skills forces every artifact — requirements, acceptance criteria, tests, implementation — to trace back to the original spec, with validation gates that catch drift before it compounds.
The pipeline was built to solve four problems we hit repeatedly:
| Problem | What happens without exe-build | How exe-build fixes it |
|---|---|---|
| Tests validate the wrong thing | AI writes tests for what it thinks the feature does, not what was asked | Tests are generated from validated acceptance criteria, not from the implementation |
| Requirements are vague | AI fills gaps with unchecked assumptions | EARS methodology forces every requirement into a precise, testable statement |
| Fixes are random | AI makes changes until tests pass instead of diagnosing root causes | 8 domain-specialist agents fix defects by class (auth, concurrency, security, etc.) |
| Coverage is invisible | No way to verify all requirements have tests | Numbered REQ → AC → TEST → TASK chain with mechanical coverage checks (E1–E5 gates) |
Every design choice in exe-build has a reason and a cost. This section documents both so you can decide if the trade-offs make sense for your workflow.
What we chose: All requirements must follow EARS (Easy Approach to Requirements Syntax) patterns — Ubiquitous, Event-driven, State-driven, Optional, or Complex.
Why: Free-form requirements like "handle authentication" are untestable. EARS forces every requirement into a precise, testable statement: "When the user submits the login form, the system shall validate credentials against the auth service." This structure is what makes downstream automation possible — ac-writer can enumerate scenarios mechanically, test-writer can generate assertions directly from the pattern.
Trade-off: EARS adds friction to the spec phase. Simple features that "obviously" need two lines of description now get a full requirements.md with metadata fields. For Tier 1 features (< 3 requirements), this overhead is noticeable. We mitigate this with the tiered pipeline — Tier 1 skips design.md and inline AC, cutting the spec phase roughly in half.
Alternative considered: Just generate tests from a natural-language description. This is faster but produces tests that validate the AI's interpretation of the description, not the user's intent. We chose correctness over speed.
What we chose: 10+ explicit validation gates across the pipeline. Each gate has specific pass/fail criteria. Failed gates trigger targeted repair (up to 3 attempts) before the pipeline hard-fails.
Why: Without gates, errors compound. A missing error path in requirements becomes a missing scenario in AC, becomes a missing test, becomes an unhandled edge case in production. Each downstream artifact amplifies the original gap. Gates catch problems at the source.
Trade-off: Gates add token cost. Each gate evaluation requires the AI to re-read artifacts and check criteria. Repair loops multiply this — a single failed gate can cost 3x the tokens of the original generation. On average, the full pipeline uses 2–4x more tokens than ungated generation.
Why we accept this cost: Fixing a bad spec after implementation is 10–50x more expensive than fixing it at the gate. The 2–4x overhead is a bargain compared to rework.
What we chose: 8 domain-expert sub-agents (validation, auth, state, error-handling, concurrency, data, integration, security), each with a 300–450 line playbook.
Why: A generic "fix the failing test" prompt produces shallow fixes — adding a try/catch to suppress an error, hardcoding an expected value, or "fixing" the test to match wrong behavior. Domain specialists know the patterns for their area. The auth-fix specialist knows to check token expiration, session invalidation, and CSRF protection — not just make the auth test pass.
Trade-off: 8 specialist playbooks add ~3,000 lines of Markdown to the repo. Each specialist dispatch is a separate agent invocation with its own context window cost. For simple defects, a generalist would fix it faster and cheaper.
Why we accept this cost: Simple defects are caught by the implementation phase itself. By the time fix-dispatcher runs, the remaining defects are the hard ones — the ones where generic prompting produces band-aid fixes that break something else.
What we chose: Every artifact has numbered IDs. Bidirectional links are maintained across all artifacts. Changes to one artifact propagate traceability updates to others.
Why: When a test fails, you need to know which requirement it validates, which acceptance criterion defined the expected behavior, and which task is responsible for the implementation. Without traceability, the AI guesses — and guesses wrong. With traceability, fix-dispatcher knows exactly which specialist to send and exactly which code to change.
Trade-off: ID management is a significant portion of the skill instructions. The backfill phase (Phase 3 of test-writer) exists solely to update tasks.md with TEST-### references. This is pure bookkeeping overhead.
Why we accept this cost: Traceability is what makes the verification loop work. Without it, code-evaluator can't map failures to root causes, and fix-dispatcher can't route defects to the right specialist. The bookkeeping enables everything downstream.
What we chose: The pipeline has exactly one interactive checkpoint at the start (confirm feature scope, review tier classification). After that, it runs fully autonomously until completion or hard failure.
Why: AI agents lose coherence when they stop and start. Each pause breaks the execution flow, requires re-reading context, and introduces opportunities for scope drift. A single checkpoint at the start validates the user's intent, then the pipeline executes without interruption.
Trade-off: If the pipeline makes a wrong assumption during autonomous execution, it will build on that assumption until a gate catches it (or until it finishes). The user can't course-correct mid-run.
Why we accept this cost: Gates are the course-correction mechanism. A wrong assumption in requirements.md will be caught when ac-writer can't generate valid scenarios for it, or when test-writer's evaluation criteria fail. The gates are the checkpoints — they're just automated.
What we chose: The pipeline generates executable tests from acceptance criteria before any implementation code is written. Implementation means "make the generated tests pass."
Why: When tests are written after code, they test what the code does — not what it should do. AI agents are particularly prone to this: they write code, then write tests that validate exactly that code, including its bugs. By generating tests from AC before implementation, the tests are independent of the implementation and validate the spec, not the code.
Trade-off: Generated tests sometimes require test infrastructure that doesn't exist yet (mocks, fixtures, test databases). The implementation phase has to set up this infrastructure as part of making tests pass, which can be awkward.
What we chose: The pipeline runs inline (same conversation context) by default. Only fix-dispatcher uses separate agents, and only because specialists benefit from isolated context.
Why: Every agent dispatch loses context. The parent must summarize what happened, the child must re-read artifacts. Running inline preserves the full context of what was generated and why. Fix specialists are the exception because they need deep domain context from their playbooks, and their changes are isolated enough to work without full pipeline context.
Trade-off: Long pipelines can approach context window limits. For Tier 3 features, inline execution can consume 300K–500K tokens in a single conversation. The skill includes a pressure valve: if context exceeds ~800K tokens, remaining phases are split to agents.
What we chose: spec-writer auto-detects your project's language, framework, test framework, build system, and directory structure, then embeds this as a YAML frontmatter block in requirements.md.
Why: Downstream skills need to know how to generate tests (Jest vs pytest vs Go testing), where to put files, and what assertion patterns to use. Rather than having each skill re-scan the codebase, the profile is detected once and consumed by all downstream phases.
Trade-off: The profile can be wrong. Auto-detection from file patterns is heuristic. A monorepo with both Python and TypeScript might get the wrong primary language. The profile supports user overrides (# user-override comments) but the user has to notice the error first.
Superpowers is a workflow plugin for Claude Code that provides brainstorming, plan writing, and subagent-driven execution with two-stage review (spec compliance + code quality).
Where Superpowers excels:
- Lower friction for small tasks — brainstorming is conversational, plans are quick
- Subagent-per-task model keeps context fresh and prevents pollution
- Two-stage review (spec compliance, then code quality) is clean and simple
- Plugin marketplace installation (
/plugin install) is simpler than copying files (exe-build plugin coming soon) - Works well for plan-execute workflows where the plan is written by a human or the AI
Where exe-build differs:
| Dimension | Superpowers | exe-build |
|---|---|---|
| Spec format | Natural-language design doc | EARS atomic requirements with metadata (from your design doc/PRD) |
| Design phase | Built-in brainstorming skill (Socratic Q&A → design doc) | Outsourced — use Perplexity, Context7, or any research tool to create a design doc, then feed it into the pipeline. If the doc has gaps, spec-writer notes assumptions and the Initial Review checkpoint lets you correct scope before autonomous execution. Standalone /spec-writer will also ask targeted questions to fill gaps. |
| Traceability | Task-level: plan task → implementation → spec review → quality review | Requirement-level: REQ → AC → TEST → TASK with numbered IDs and bidirectional links across all artifacts |
| Validation | Two-stage review per task (spec compliance check, then code quality check with Critical/Important/Minor severity) | 10+ gates across the full pipeline (E1–E5 for AC coverage, C1–C5 for implementation, Gate 1–3 for specs) with auto-repair loops on failure |
| Test generation | Strict TDD per task — same agent writes failing test first, then implements code to pass it (RED-GREEN-REFACTOR) | Tests derived from acceptance criteria before any implementation begins — a separate skill generates all tests from AC, then implementation makes them pass. Tests are independent of the implementer's interpretation. |
| Fix strategy | Implementer fixes own code based on reviewer feedback, reviewer re-checks | 8 domain specialists (auth, security, state, concurrency, etc.) dispatched by defect class — each with a 300–450 line playbook of domain-specific patterns |
| Failure analysis | Code reviewer categorizes issues by severity (Critical/Important/Minor) with file:line references and fix suggestions | code-evaluator classifies failures by defect type, maps each to root cause and originating requirement, prioritizes by impact, then routes to the right specialist |
| Execution model | Subagent-per-task — each task gets a fresh agent with clean context, no pollution from previous tasks | Inline by default — entire pipeline runs in one conversation, preserving full context of why each requirement exists and how each AC was derived across all phases |
| Scaling | Same process for all tasks | Tier 1/2/3 adapts pipeline depth to feature complexity — Tier 1 skips design/AC for simple features, Tier 3 adds architecture review and property-based tests |
| Token cost | Lower per-task (one implementer + two reviewers per task) | Higher total (full pipeline with gates and repair loops), but front-loads cost into spec validation to prevent expensive rework downstream |
The execution model trade-off is fundamental. Superpowers spawns a fresh subagent for every task. Each agent starts clean — no context pollution from previous tasks, no risk of the model "forgetting" instructions mid-conversation. The cost is that each subagent loses the context of what other tasks did. The parent orchestrator summarizes, but summaries lose nuance. exe-build takes the opposite bet: run the entire pipeline inline so that when test-writer generates tests, it has the full context of why spec-writer wrote each requirement and how ac-writer derived each acceptance criterion. No summarization loss. The cost is that long pipelines accumulate context and can approach window limits — exe-build mitigates this by splitting to agents only when context exceeds ~800K tokens or for fix-specialist dispatch where isolated domain expertise matters more than pipeline context.
The test generation trade-off matters. Both systems enforce TDD — tests before code. The difference is where the tests come from. In Superpowers, the same agent that writes the test also writes the implementation. It's disciplined (strict RED-GREEN-REFACTOR), but the tests reflect that single agent's interpretation of the task. In exe-build, tests are generated by a separate skill (test-writer) from validated acceptance criteria that were derived from EARS requirements. The tests are structurally independent of the implementation — they validate the spec, not the implementer's understanding of the spec. This means exe-build catches cases where the implementer would have misunderstood the requirement, because the test was written by a different phase with different inputs.
The design phase trade-off. Superpowers handles both design and execution in one system — brainstorming explores your intent through Socratic questions, produces a design doc, then the planner and subagents build it. exe-build intentionally separates design from execution. We believe you should have an extensive, in-depth conversation about what you want to build — using Perplexity, Context7, Claude, or any research tool — before you start building. Produce a thorough design document or PRD, then feed it into exe-build's pipeline. This isn't a limitation — it's a design choice. The research and brainstorming phase benefits from tools optimized for exploration (web search, deep research, multi-turn conversation). The build phase benefits from tools optimized for structured execution (gates, traceability, auto-repair). Combining both into one system means neither is best-in-class at its job. That said, exe-build doesn't leave you stranded with a bad doc. When you run /exe-build-e2e with an incomplete or vague document, the pipeline will extract what it can, note assumptions for any gaps it finds in requirements.md, and the Initial Review checkpoint gives you a chance to correct scope before autonomous execution begins. You don't need to run individual skills — /exe-build-e2e handles the full flow from document to verified code, including gap handling.
Neither system is universally better. The right choice depends on what you're building.
Superpowers is faster when:
- Tasks are well-defined and independent (subagent-per-task parallelizes naturally)
- You're iterating rapidly on a feature you understand well
- The plan is solid and you need execution speed, not spec validation
- You want quick brainstorm → plan → build cycles for small-to-medium features
- You need git worktree isolation, branch management, and merge/PR workflows built in
exe-build produces higher coverage when:
- Requirements are complex, ambiguous, or derived from a document (EARS forces precision)
- Missing an edge case would be costly (gates mechanically check every requirement has tests)
- You need to prove traceability for compliance or audit (numbered REQ → AC → TEST → TASK chain)
- The feature has many error paths, state transitions, or integration points (ac-writer enumerates them systematically)
- Defects are domain-specific (auth bugs need auth expertise, not generic "fix it")
The honest assessment: For a well-understood 30-minute task, Superpowers' subagent model will ship it faster with less token spend. For a complex feature with 8+ requirements where you need confidence that every error path is tested and every requirement is traceable to implementation, exe-build's pipeline catches gaps that per-task reviews miss — because per-task reviews can't see cross-task coverage holes.
These systems have complementary strengths. Superpowers has skills exe-build doesn't — and exe-build has pipeline depth Superpowers doesn't. Here's how to use both:
Workflow 1: Superpowers brainstorming → exe-build pipeline
Use Superpowers' brainstorming skill to explore your intent and produce a design doc through Socratic Q&A. Then feed that design doc into exe-build's exe-build-e2e pipeline, which transforms it into EARS requirements, validated acceptance criteria, executable tests, and verified code with full traceability.
Workflow 2: exe-build specs → Superpowers execution
Run exe-build's spec phase (spec-writer → ac-writer → test-writer) to generate validated requirements, acceptance criteria, and tests. Then use Superpowers' subagent-driven-development to execute the tasks.md — each task gets a fresh agent with two-stage review, and the tests are already written from AC so the implementer just needs to make them pass.
Workflow 3: Use Superpowers skills that exe-build lacks Superpowers includes skills for areas exe-build doesn't cover:
| Superpowers skill | What it does | When to use with exe-build |
|---|---|---|
using-git-worktrees |
Creates isolated branch workspace | Before starting an exe-build pipeline run — isolate the work |
systematic-debugging |
4-phase root cause investigation | When fix-dispatcher exhausts 3 iterations and you need manual debugging |
verification-before-completion |
"Evidence before claims" — run proof commands before declaring done | After exe-build pipeline completes — final smoke test before merge |
finishing-a-development-branch |
Verify tests → present merge/PR options → clean up | After exe-build pipeline completes — merge the work back to main |
receiving-code-review |
Respond to review feedback with technical rigor, not blind agreement | When a human reviewer has comments on code exe-build generated |
brainstorming |
Socratic design exploration | When you don't have a design doc yet and don't want to use Perplexity |
A realistic combined session:
1. /superpowers:using-git-worktrees → isolate the work on a branch
2. /superpowers:brainstorming → explore intent, produce design doc
(or use Perplexity/Context7 for deeper research)
3. /exe-build-e2e @design-doc.md → spec → AC → tests → implement → verify
4. /superpowers:verification-before-completion → final evidence-based check
5. /superpowers:finishing-a-development-branch → merge or PR
GSD is a full project management framework with roadmapping, phase planning, execution, and verification — designed for multi-milestone, multi-phase projects.
Where GSD excels:
- Project-level orchestration — roadmaps, milestones, multi-phase planning
- Rich agent ecosystem (15+ specialized agents: planner, executor, verifier, debugger, UI auditor, etc.)
- State management across sessions (STATE.md, checkpoints, resume-work)
- Goal-backward verification (checks if the goal was achieved, not just if tasks completed)
- UI-specific workflows (UI-SPEC.md, 6-pillar visual audit)
- Built for long-running projects with multiple developers
Where exe-build differs:
| Dimension | GSD | exe-build |
|---|---|---|
| Scope | Full project lifecycle (roadmap → milestone → phase → task) | Single feature (idea → verified implementation) |
| Planning | Discussion → research → plan → execute → verify per phase | Spec → AC → tests → implement → verify per feature |
| Requirements | Natural language in CONTEXT.md with locked/deferred decisions | EARS-format atomic requirements with metadata in requirements.md |
| Verification | Goal-backward (did we achieve the goal?) | Gate-forward (did each artifact pass validation?) + code evaluation |
| Test approach | Opt-in TDD per task (tdd="true" in plan) + Nyquist auditor fills coverage gaps post-implementation |
TDD-native by default — all tests generated from AC before any implementation begins |
| Fix approach | Debugger agent with scientific method | 8 domain specialists dispatched by defect classification |
| State persistence | STATE.md, SUMMARY.md, checkpoints across sessions | Artifacts in docs/features/ (stateless pipeline, re-runnable) |
| Agent count | 15+ specialized agents | 9 skills + 8 sub-agents |
| Token cost | High (many agent spawns per phase) | High (deep per-feature pipeline) |
The scope trade-off is the key difference. GSD manages the project — what gets built when, in what order, across how many phases. exe-build manages the feature — given a feature to build, produce a verified implementation. GSD asks "what should Phase 3 of Milestone 2 deliver?" exe-build asks "does this implementation satisfy all 7 requirements for the auth feature?" These are different questions at different levels of abstraction, which is why the systems compose well rather than compete.
The verification trade-off matters. GSD's goal-backward verification is powerful — it starts from the desired outcome and checks whether the codebase actually delivers it, regardless of whether tasks were marked complete. This catches the "placeholder problem" where a task is technically done (file created) but the goal isn't achieved (component is a stub). exe-build's gate-forward validation catches a different class of problems — coverage gaps where a requirement has no AC, an AC has no test, or a test has no task. Goal-backward asks "does it work?" Gate-forward asks "is everything covered?" You want both.
The testing trade-off. GSD supports TDD per task (opt-in via tdd="true" in the plan), and its Nyquist auditor retroactively fills test coverage gaps after implementation. exe-build generates all tests from acceptance criteria before implementation begins — every test exists before a line of code is written. GSD's approach is more flexible (you choose when to apply TDD), exe-build's is more rigorous (tests are structurally independent of implementation). GSD's Nyquist auditor is a safety net that catches what was missed; exe-build's evaluation criteria (E1–E5) are a gate that blocks progress until coverage is complete.
The state persistence trade-off. GSD maintains rich state across sessions — STATE.md tracks position, SUMMARY.md records what happened, checkpoints allow pausing and resuming across context resets. This makes GSD excellent for multi-day, multi-session projects. exe-build is stateless by design — all state lives in the artifact files (docs/features/*/). You can re-run any phase at any time by pointing it at the existing artifacts. The trade-off: GSD handles interruptions gracefully, exe-build handles re-runs gracefully.
The fix approach trade-off. GSD's debugger agent uses the scientific method — hypothesis, experiment, conclusion. It's a generalist that can investigate any bug through systematic reasoning. exe-build's fix-dispatcher routes defects to 8 domain specialists, each with a detailed playbook. The debugger is better for novel, unexpected bugs where you don't know the category. The specialists are better for known defect classes (auth, concurrency, data integrity) where domain-specific patterns and anti-patterns matter more than general investigation.
GSD is stronger when:
- You're managing a multi-milestone project that spans weeks or months
- Work happens across multiple sessions and needs persistent state
- Phases have diverse types of work (UI, backend, infrastructure) that benefit from specialized agents (UI auditor, codebase mapper, integration checker)
- You need a research phase before planning — GSD has dedicated researcher and synthesizer agents
- The project involves UI work — GSD has UI-SPEC.md, 6-pillar visual audit, and UI-specific quality checks
exe-build is stronger when:
- You need deep, feature-level spec-to-test coverage with full traceability
- Requirements must be precise and testable (EARS eliminates ambiguity)
- You need mechanical proof that every requirement has acceptance criteria and every AC has tests (E1–E5 gates)
- Defects need domain-expert fixes rather than general debugging
- The feature is self-contained and can be specified, tested, and verified in one pipeline run
The honest assessment: GSD is the better project management system — it handles the complexity of coordinating multi-phase work across sessions. exe-build is the better feature verification system — it ensures that any individual feature is thoroughly specified, tested, and traceable. They operate at different levels and don't conflict.
GSD and exe-build compose naturally because they operate at different levels — project vs feature.
Workflow 1: GSD roadmap → exe-build per feature
Use GSD to create the project roadmap, break it into milestones and phases, and manage cross-phase dependencies. When a phase requires building a feature with rigorous spec-to-test coverage, GSD's executor can delegate to exe-build's pipeline. The exe-build artifacts (requirements.md, ac.json, tests/) become the phase deliverables that GSD's verifier checks.
Workflow 2: exe-build specs → GSD execution Run exe-build's spec phase (spec-writer → ac-writer → test-writer) to generate validated requirements, AC, and tests. Then use GSD's executor to implement the tasks.md with its deviation handling, atomic commits, and checkpoint protocols. GSD's state persistence means you can pause mid-implementation and resume in a new session.
Workflow 3: Use GSD agents that exe-build lacks GSD has agents for areas exe-build doesn't cover:
| GSD agent | What it does | When to use with exe-build |
|---|---|---|
gsd-debugger |
Scientific-method bug investigation with persistent debug sessions | When fix-dispatcher exhausts 3 iterations — switch to systematic root cause investigation |
gsd-verifier |
Goal-backward verification (does the code achieve the goal, not just complete tasks?) | After exe-build pipeline completes — verify the feature actually works end-to-end, not just that gates passed |
gsd-ui-auditor |
6-pillar visual audit of frontend code | After exe-build builds a frontend feature — audit the visual quality |
gsd-ui-researcher |
Produces UI-SPEC.md design contracts | Before running exe-build on a frontend feature — define the visual spec |
gsd-codebase-mapper |
Parallel codebase analysis across tech, architecture, quality, concerns | Before starting exe-build on a large unfamiliar codebase — understand what you're working with |
gsd-integration-checker |
Verifies cross-phase integration and E2E user flows | After multiple exe-build pipeline runs — check that features connect properly |
A realistic combined session:
1. /gsd:new-project → roadmap with milestones and phases
2. /gsd:plan-phase → plan Phase 1 with research and task breakdown
3. /exe-build-e2e @phase-1-design.md → spec → AC → tests → implement → verify
4. /gsd:verify-work → goal-backward check: did Phase 1 deliver?
5. /gsd:plan-phase → plan Phase 2
6. /exe-build-e2e @phase-2-design.md → repeat for Phase 2
7. /gsd:audit-milestone → audit milestone completion before moving on
Standard TDD (Test-Driven Development) and SDD (Spec-Driven Development, e.g., open-spec) workflows follow a simpler pattern: write tests or specs by hand, implement code, verify. This is how most professional software gets built today — no AI pipeline, no token cost, just developers and their tools.
Where standard TDD/SDD excels:
- Minimal overhead — no pipeline, no gates, no artifact generation
- Human judgment drives the spec (not AI generation that might hallucinate requirements)
- No token cost beyond the implementation itself
- Well-understood by any developer — no tooling setup, no learning curve
- The developer who writes the spec has full domain context that no AI possesses
- Battle-tested across decades of real-world software projects
Where exe-build differs:
| Dimension | Standard TDD/SDD | exe-build |
|---|---|---|
| Who writes specs | Human developer with domain expertise | AI generates from idea/doc, human reviews once |
| Who writes tests | Human developer | AI generates from validated AC |
| Spec format | Varies (user stories, Gherkin, free text, internal docs) | EARS atomic requirements + machine-readable AC JSON |
| Traceability | Manual (developer keeps track, or uses JIRA/Linear links) | Automatic (REQ → AC → TEST → TASK with numbered IDs, bidirectional) |
| Validation | Human review + CI pipeline | 10+ automated gates with repair loops, then CI |
| Coverage gaps | Caught by code review, QA, or production incidents | Caught by evaluation criteria E1–E5 (mechanically, before implementation) |
| Fix routing | Developer debugs and fixes based on experience | Automated defect classification + specialist dispatch |
| Iteration speed | Depends on team size and developer speed | AI generates in minutes, but pipeline adds overhead |
| Token cost | Zero (human-driven) | 50K–500K+ tokens per feature depending on complexity |
| Wall-clock time | Hours to days (human writes everything) | Minutes to hours (AI generates, human reviews) |
The authorship trade-off. When a human writes specs and tests, every decision reflects their domain knowledge, their understanding of edge cases from years of experience, and their intuition about what matters. No AI can match a senior developer's domain expertise. But human-authored specs have blind spots — the developer doesn't write tests for the edge cases they don't think of. exe-build's mechanical coverage checks (E1–E5) catch a different class of gaps: the systematic ones. "Every requirement has at least one AC, every error case has a test, every state transition is covered" — these are checks that don't require domain expertise, just thoroughness. A human is better at knowing which edge cases matter most. exe-build is better at ensuring none are accidentally skipped.
The traceability trade-off. In manual TDD, traceability lives in the developer's head, in JIRA ticket links, or in commit messages. It's implicit and often incomplete. exe-build's numbered ID system (REQ-001 → AC-001 → TEST-001 → TASK-001) is explicit and machine-readable, but it's also rigid — it adds structural overhead that a small team doing TDD in a well-understood codebase doesn't need. Traceability matters most when: (a) requirements come from external stakeholders who need proof of coverage, (b) the codebase has compliance requirements, or (c) the team is large enough that implicit knowledge gets lost. For a solo developer on a personal project, it's overkill.
The cost trade-off. Standard TDD costs zero tokens but costs developer hours. exe-build costs 50K–500K tokens but generates specs, tests, and traceability in minutes. The right comparison isn't "free vs expensive" — it's "developer time vs token spend." If developer time is cheap and domain knowledge is critical, manual TDD wins. If developer time is expensive and you need validated specs fast, exe-build wins.
The trust trade-off. Human-authored specs are trusted by default — the developer understands what they wrote. AI-generated specs require review — the developer must verify that the AI understood the intent correctly. This review cost is real but often overlooked. exe-build mitigates it with the Initial Review checkpoint and EARS formatting (structured requirements are easier to verify than free-form prose), but it doesn't eliminate it.
Standard TDD/SDD is better when:
- The developer has deep domain expertise and knows the edge cases from experience
- The feature is well-understood and doesn't need formal specification
- Token cost is a concern and developer time is available
- The team has established spec/test patterns and doesn't need tooling to enforce them
- Requirements come from the developer's own understanding (not an external doc)
- The project is small enough that implicit traceability (commit messages, ticket links) is sufficient
exe-build is better when:
- Requirements come from a document (PRD, design doc, research) that needs to be transformed into testable specs
- The feature has many error paths, state transitions, or integration points that a human might not enumerate completely
- You need explicit, auditable traceability (compliance, external stakeholders, handoffs between teams)
- Developer time is more expensive than token cost
- You want coverage guarantees before implementation, not coverage discovery after production incidents
- The spec author and the implementer are different people (or different AI agents) — exe-build's separation of concerns prevents the implementer from writing tests that validate their own misunderstanding
The honest assessment: A senior developer doing disciplined TDD will produce better specs and more insightful tests than exe-build for features in their area of expertise. exe-build will produce more complete coverage (no gaps in requirement-to-test mapping) and do it faster in wall-clock time. The ideal workflow may be: use exe-build to generate the initial spec and tests, then have a senior developer review and refine them — combining mechanical completeness with human judgment.
exe-build and manual TDD are not mutually exclusive. The pipeline generates artifacts that fit naturally into a standard development workflow.
Workflow 1: exe-build specs → human TDD implementation Run exe-build's spec phase (spec-writer → ac-writer → test-writer) to generate validated requirements, acceptance criteria, and test skeletons. Then hand the artifacts to a developer who implements using traditional TDD — they have a complete spec, pre-written test cases, and a traceability matrix. The developer can modify, add, or remove tests based on their domain expertise.
Workflow 2: Human specs → exe-build test generation
Write requirements and design docs by hand (the way you always have). Then run just /test-writer to generate tests from your specs. exe-build's evaluation criteria (E1–E5) will mechanically check whether your hand-written specs have coverage gaps — missing error paths, untested state transitions, requirements without acceptance criteria. This uses exe-build as a coverage audit tool rather than a full pipeline.
Workflow 3: exe-build as CI gate Integrate exe-build's evaluation criteria into your existing CI pipeline. After a developer writes specs and tests manually, run the evaluation phase to verify coverage completeness. This adds mechanical coverage checking to your existing human-driven workflow without replacing it.
A realistic combined session:
1. Developer writes design doc manually (or with Perplexity/research)
2. /spec-writer @design-doc.md → EARS requirements from the doc
3. Developer reviews requirements.md, corrects domain-specific issues
4. /ac-writer → scenarios + acceptance criteria
5. /test-writer → generated tests + coverage evaluation
6. Developer reviews tests, adds domain-specific edge cases
7. Developer implements using traditional TDD (tests are already written)
8. /test-runner → verify all tests pass
These are approximate costs for a Tier 2 feature (3–8 requirements) using Claude Opus 4.6:
| Phase | Input tokens | Output tokens | Notes |
|---|---|---|---|
| spec-writer | 15K–30K | 8K–15K | Codebase scan + 3 artifacts + 3 gates |
| ac-writer | 10K–20K | 5K–10K | Reads specs, generates scenarios + AC + 2 gates |
| test-writer | 15K–30K | 10K–20K | Evaluation + test generation + backfill + 3 gates |
| Implementation | 20K–50K | 15K–40K | Reads tasks + tests, writes code |
| test-runner | 5K–10K | 2K–5K | Executes tests, collects results |
| code-evaluator | 15K–25K | 8K–15K | Failure mapping + classification + recommendations |
| fix-dispatcher | 10K–20K per specialist | 5K–15K per specialist | Only if fixes needed |
| Repair loops | 10K–20K per attempt | 5K–10K per attempt | Only if gates fail |
Typical total for a Tier 2 feature: 100K–250K tokens (no repair loops) to 300K–500K tokens (with repair loops and fix iterations).
Tier 1 (abbreviated): 30K–80K tokens. Skips design.md, ac-writer, and code-evaluator.
Tier 3 (extended): 300K–800K+ tokens. Adds architecture review, property-based tests, and code review.
The alternative is not "zero tokens" — it's "the same tokens spent on rework." Without the pipeline:
- AI writes code from a vague description (~50K tokens)
- Tests fail or are wrong (~20K tokens to debug)
- Human discovers a missed requirement (~50K tokens to rework)
- More tests fail (~20K tokens)
- Repeat until frustrated
Total: 150K–300K tokens with lower quality and no traceability. The pipeline front-loads the cost into structured spec work that prevents rework.
Recommended: Claude Opus 4.6 (1M context)
The pipeline was designed for and tested with Claude Opus 4.6. Here's why:
- Context window: A full Tier 2 pipeline generates 80K–150K tokens of artifacts. Inline execution (our default) needs the full pipeline context in one conversation. Opus 4.6's 1M context handles this easily. Smaller context models (32K–128K) would force agent splitting, losing context and quality.
- Instruction following: The skills contain precise, multi-step instructions with conditional logic ("if E1–E4 fails, enter repair loop"). Opus 4.6 follows these reliably. Smaller models skip steps or invent shortcuts.
- EARS generation: Producing correct EARS-format requirements with proper metadata fields requires understanding the pattern system. Opus 4.6 generates valid EARS consistently. Sonnet 4.6 works for simpler features but occasionally drops metadata fields.
Will it work with other models?
| Model | Viability | Notes |
|---|---|---|
| Claude Opus 4.6 | Full support | Designed for this model |
| Claude Sonnet 4.6 | Good for Tier 1–2 | May need agent splitting for Tier 3; occasional gate issues |
| Claude Haiku 4.5 | Tier 1 only | Struggles with multi-gate validation and complex EARS patterns |
| GPT-4o / o1 | Untested | Skills are model-agnostic Markdown; should work with prompt adaptation |
| Other agents | Untested | Any agent that reads .md skill files can use these |
If token cost is a concern:
- Use Tier 1 for small features. Override the auto-classifier: "Make it Tier 1." This skips design.md, ac-writer, and code-evaluator.
- Run individual skills. Use
/spec-writeralone to generate specs, then implement manually. - Skip the fix loop. If your implementation is close, run test-runner and fix manually instead of letting fix-dispatcher iterate.
- Use Sonnet 4.6 for Tier 1. The abbreviated pipeline is simple enough for Sonnet.
These skills are plain Markdown files. There is no:
- Compiled code or binaries
- Network calls or telemetry
- External service dependencies
- Encrypted or obfuscated content
- Dynamic code generation beyond what your AI agent does natively
- Package dependencies or install scripts
Every instruction the AI receives is readable text. If a skill tells the AI to do something, you can see exactly what it says. The entire system is ~14,500 lines of Markdown across 9 skill files and 8 specialist sub-agent playbooks.
With Claude Code (recommended — audits everything at once):
claude -p "Read every SKILL.md file in .claude/skills/ and every .md file in \
.claude/subagents/. For each file, report: (1) what it instructs the AI to do, \
(2) what files it reads, (3) what files it writes or modifies, (4) any shell \
commands it executes, (5) any network or external service access, (6) anything \
that could be a security concern. Summarize with a risk assessment."With any other AI agent (Cursor, Windsurf, Codex, etc.):
Point the agent at the skill directories and ask the same question. The files are standard Markdown — any tool that can read text can audit them.
With grep (quick automated scan for risky patterns):
# Check for network/fetch/curl/API instructions
grep -ri "curl\|wget\|fetch(\|http://\|https://\|api\..*\.com" .claude/skills/ .claude/subagents/
# Check for file deletion instructions
grep -ri "rm \|rm -\|delete\|rmdir\|unlink\|fs\.remove" .claude/skills/ .claude/subagents/
# Check for shell execution beyond test running
grep -ri "child_process\|subprocess\|os\.system\|exec(\|spawn(" .claude/skills/ .claude/subagents/
# Check for credential or secret handling
grep -ri "api.key\|api_key\|secret\|credential\|\.env" .claude/skills/ .claude/subagents/
# Check for base64 or encoding (could hide content)
grep -ri "base64\|btoa\|atob\|encode(" .claude/skills/ .claude/subagents/If any of these return results, read the context — some keywords appear in the AC generation examples (like "password" in an auth feature example) but do not instruct the AI to handle credentials.
Diffing against upstream (tamper detection):
# After cloning, verify no modifications
git diff origin/main -- .claude/skills/ .claude/subagents/| Skill | Reads | Writes | Executes |
|---|---|---|---|
| spec-writer | Your codebase structure (to detect stack) | docs/features/*/requirements.md, design.md, tasks.md |
Nothing |
| ac-writer | Requirements and design files | scenarios.md, ac.json |
Nothing |
| test-writer | AC files, your codebase | tests/*.test.*, evaluation.json, testids.json |
Nothing |
| test-runner | Test files | test-results.json |
Your test command (npm test, pytest, etc.) |
| code-evaluator | Test results, spec files | code-evaluation.json, code-report.md |
Nothing |
| fix-dispatcher | Evaluation results, spec files | Your source code (targeted fixes) | Nothing |
| code-reviewer | Your code, spec files | code-review.md |
Nothing |
| ac-fix | Evaluation feedback, AC files | ac-fixed.json |
Nothing |
| exe-build-e2e | All of the above (orchestrator) | All of the above | Delegates to test-runner |
The only skill that executes shell commands is test-runner, and it runs your project's existing test command (the same command you'd run manually).
fix-dispatcher modifies your source code, but only the specific files identified by code-evaluator's failure analysis. It does not modify files outside the scope of the failing tests.
These skills operate within your AI agent's existing permission model:
- Claude Code: If you use
--allowedToolsor permission prompts, skills cannot bypass them. The skills are instructions — your agent enforces the actual permissions. - Other agents: Skills operate within the agent's built-in approval flow. File writes require your approval unless auto-approved.
- Any agent: The skills never circumvent the agent's built-in permission system. They are text instructions, not code that runs independently.
Each skill is a Markdown file with YAML frontmatter:
---
name: spec-writer
version: v3
description: When to use this skill
---
# Skill Name
## Overview
What the skill does.
## When to Use
Trigger conditions.
## Pipeline
Step-by-step instructions with validation gates.
## Gates
Validation criteria that must pass before proceeding.The AI agent reads these files as instructions. The frontmatter enables discovery and routing. The Markdown body is the actual behavior specification.
The exe-build-e2e skill orchestrates the others in sequence:
spec-writer → ac-writer → test-writer → [implementation] → test-runner → code-evaluator → fix-dispatcher
│ │
▼ ▼
code-reviewer 8 specialist
[Tier 3] sub-agents
Each skill:
- Declares its inputs (which files must exist)
- Validates inputs before proceeding
- Generates outputs with deterministic naming
- Runs validation gates on its outputs
- Auto-repairs gate failures (up to 3 attempts)
Requirements use the EARS (Easy Approach to Requirements Syntax) patterns:
| Pattern | Template | Example |
|---|---|---|
| Ubiquitous | The system shall [action] | The system shall encrypt passwords at rest |
| Event-driven | When [event], the system shall [action] | When session expires, the system shall redirect to login |
| State-driven | While [state], the system shall [action] | While offline, the system shall queue mutations |
| Optional | Where [condition], the system shall [action] | Where MFA is enabled, the system shall require a second factor |
| Complex | While [state], when [event], the system shall [action] | While rate-limited, when new request arrives, the system shall return 429 |
This eliminates ambiguous requirements like "the system should handle authentication" and forces precise, testable statements.
| Tier | Requirements | Pipeline | What's included | What's skipped |
|---|---|---|---|---|
| Tier 1 | < 3 | Abbreviated | requirements.md, tests, test-runner | design.md, ac-writer, code-evaluator |
| Tier 2 | 3–8 | Standard | All phases, all gates, fix loop | Property-based tests, code review |
| Tier 3 | > 8 or multi-service | Extended | Everything + architecture review, property-based tests, code review | Nothing |
The pipeline auto-classifies based on requirement count but you can override: "Make it Tier 1."
When code-evaluator identifies defects, fix-dispatcher routes each to the right expert:
| Sub-agent | Domain | Playbook size |
|---|---|---|
validation-fix |
Input validation, schema enforcement | ~264 lines |
auth-fix |
Authentication, authorization, sessions | ~346 lines |
state-fix |
State machines, race conditions, lifecycle | ~314 lines |
error-handling-fix |
Error propagation, recovery, boundaries | ~385 lines |
concurrency-fix |
Threading, async, deadlocks | ~370 lines |
data-fix |
Data transforms, serialization, integrity | ~386 lines |
integration-fix |
API contracts, protocols, versioning | ~416 lines |
security-fix |
Injection, XSS, CSRF, cryptography | ~345 lines |
Each playbook contains domain-specific patterns, anti-patterns, verification strategies, and example fixes.
exe-build-skills/
├── .claude/
│ ├── skills/ # Skills for Claude Code (9 skills)
│ │ ├── spec-writer/ # EARS requirements + design + tasks
│ │ ├── ac-writer/ # Scenarios + acceptance criteria
│ │ ├── test-writer/ # Test generation + evaluation
│ │ ├── test-runner/ # Test execution + results
│ │ ├── code-evaluator/ # Failure analysis + defect classification
│ │ ├── code-reviewer/ # Spec compliance review
│ │ ├── fix-dispatcher/ # Specialist routing + fix orchestration
│ │ ├── ac-fix/ # AC repair from evaluation feedback
│ │ └── exe-build-e2e/ # Pipeline orchestrator
│ ├── subagents/ # 8 specialist fix sub-agent playbooks
│ └── orchestrator/ # Node.js pipeline runner (optional)
├── .opencode/
│ └── skills/ # Same skills for OpenCode
├── ac-methods/ # 12 AC generation methodologies
├── docs/
│ └── features/ # Example generated specifications
└── README.md
| Skill | Description | Status |
|---|---|---|
| code-review (standalone) | Git-diff-based code review that works on any code without requiring spec files. Review any commit range, PR, or working tree changes. No pipeline dependencies. | Planned |
| security-review | Dedicated security audit — OWASP top-10, dependency vulnerability scan, secrets detection, auth/authz review, input validation coverage. Standalone or pipeline-integrated. | Planned |
| code-refactor | Structured refactoring with safety nets — identify refactor targets, generate characterization tests before changing code, validate behavior preservation after refactor, track what moved where. | Planned |
| Plugin marketplace | Install exe-build via /plugin install instead of copying files. |
Planned |
These skills will work both standalone (run on any codebase, no spec files needed) and as pipeline extensions (feed findings into code-evaluator and fix-dispatcher).
- Fork the repo
- Edit or add skills in
.claude/skills/ - Follow the existing skill format (YAML frontmatter + structured Markdown)
- Test your changes by running the pipeline against a real feature
- Submit a PR
Skills should be:
- Self-contained: No external dependencies
- Deterministic: Same input produces predictable output structure
- Gated: Include validation criteria
- Auditable: Plain Markdown, no obfuscation
MIT
Q: Do I need Claude Code specifically?
A: No. The repo includes skills for both Claude Code (.claude/skills/) and OpenCode (.opencode/skills/). They also work with Cursor, Windsurf, or any agent that reads .md instruction files from your repo.
Q: Do these skills send data anywhere? A: No. The skills are local instruction files. They don't make network calls, phone home, or transmit anything. Your AI agent processes them locally.
Q: Can I use individual skills without the full pipeline?
A: Yes. Each skill works standalone. Use /spec-writer to just generate specs, /test-writer to just generate tests, etc.
Q: What languages are supported? A: The skills auto-detect your project's language and framework. They've been tested with TypeScript, JavaScript, Python, Go, and Rust. The test generation adapts to your project's test framework.
Q: How do I verify the skills haven't been tampered with?
A: grep them. cat them. diff them against upstream. Ask any AI to audit them. They're plain text — every instruction is visible. See the Security and Auditability section for specific commands.
Q: What if a gate fails after 3 repair attempts? A: The pipeline hard-fails and reports the specific failing criteria with context. It doesn't silently skip validation or push through broken artifacts. You can fix the issue manually and resume from that phase.
Q: How does this work with Superpowers or GSD? A: They complement each other. Use GSD for project-level planning (what features to build, in what order), exe-build for feature-level spec-to-code (turning each feature into verified implementation), and Superpowers for task-level execution (subagent-per-task with reviews). See the comparison section for details.
Q: Is the token cost justified? A: For features with 3+ requirements, yes. The pipeline front-loads cost into structured spec work that prevents rework. Without it, you spend similar tokens on debugging and re-implementation, but without traceability or coverage guarantees. For trivial features (< 3 requirements), use Tier 1 or individual skills to reduce cost. See Token Usage for detailed breakdowns.
Q: Can I use a cheaper model? A: Sonnet 4.6 works for Tier 1–2 features. Haiku works for Tier 1 only. The pipeline was designed for Opus 4.6 and its instruction-following reliability. See the model table for specifics.
If exe-build-skills saved you from a bad implementation or caught a requirement gap before production, consider starring the repo — it helps other developers find it.