exe-build-skills — Spec-to-Verified-Code Pipeline for AI Agents

exe-build-skills is an open-source Spec-Driven Development (SDD) + Test-Driven Development (TDD) pipeline that turns a feature idea or design document into production-ready, verified code with full requirement traceability. It runs as plain Markdown skill files inside Claude Code, OpenCode, or any AI coding agent — no dependencies, no API keys, no compiled code. Every instruction is auditable text.

The pipeline generates EARS requirements, acceptance criteria, executable tests, implementation, and automated verification with 10+ validation gates and 8 domain-specialist fix agents. Designed for Claude Opus 4.6's 1M context window. (Updated March 18, 2026)

Recommended workflow: Before running the pipeline, use Perplexity or the latest frontier models to research what you want to build. Produce a design document or PRD in Markdown, then feed it into exe-build-e2e to activate the full pipeline. You can install the Perplexity MCP server and Context7 MCP server into Claude Code to research and generate the design doc without leaving the terminal.

         "Add user authentication with OAuth"
                          │
                          ▼
         ┌─ spec-writer ──────────────────────────────────────┐
         │  requirements.md → design.md → tasks.md            │
         │  (EARS methodology, 3 validation gates)            │
         └────────────────────────────────────────────────────┘
                          │
                          ▼
         ┌─ ac-writer ────────────────────────────────────────┐
         │  scenarios.md → ac.json                            │
         │  (Gherkin scenarios, acceptance criteria)           │
         └────────────────────────────────────────────────────┘
                          │
                          ▼
         ┌─ test-writer ──────────────────────────────────────┐
         │  evaluation.json → tests/*.test.ts                 │
         │  (executable tests in your project's framework)    │
         └────────────────────────────────────────────────────┘
                          │
                          ▼
         ┌─ Implementation ───────────────────────────────────┐
         │  TDD: write code to make the generated tests pass  │
         └────────────────────────────────────────────────────┘
                          │
                          ▼
         ┌─ Verification ─────────────────────────────────────┐
         │  test-runner → code-evaluator → fix-dispatcher     │
         │  (8 specialist sub-agents, up to 3 fix iterations) │
         └────────────────────────────────────────────────────┘

Quick Start

# 1. Clone exe-build-skills
git clone https://github.com/AskExe/exe-build-skills.git

# 2. Copy skills + sub-agents into your project
cp -r exe-build-skills/.claude/skills/ your-project/.claude/skills/
cp -r exe-build-skills/.claude/subagents/ your-project/.claude/subagents/

# 3. Open Claude Code in your project and run:
#    /exe-build-e2e

That's it. No npm install, no API keys, no config. The pipeline auto-detects your language, framework, and test runner. Feed it a feature idea or a design document and it generates EARS requirements, acceptance criteria, tests, implementation, and verification — fully autonomous after one confirmation checkpoint.

Installation

Claude Code (recommended)

Copy the .claude/skills/ directory into your project:

# Clone the repo
git clone https://github.com/AskExe/exe-build-skills.git

# Copy skills into your project
cp -r exe-build-skills/.claude/skills/ your-project/.claude/skills/

# Copy sub-agents (needed for fix-dispatcher)
cp -r exe-build-skills/.claude/subagents/ your-project/.claude/subagents/

Claude Code automatically discovers skills from .claude/skills/*/SKILL.md.

OpenCode

Copy the .opencode/skills/ directory into your project:

cp -r exe-build-skills/.opencode/skills/ your-project/.opencode/skills/

OpenCode discovers skills from .opencode/skills/*/SKILL.md with the same format.

Other AI Agents (Cursor, Windsurf, etc.)

The skills are standard Markdown files. Copy .claude/skills/ into whatever directory your agent reads for instructions.

What gets installed

your-project/
├── .claude/
│   ├── skills/                  # 9 pipeline skills
│   │   ├── spec-writer/SKILL.md
│   │   ├── ac-writer/SKILL.md
│   │   ├── test-writer/SKILL.md
│   │   ├── test-runner/SKILL.md
│   │   ├── code-evaluator/SKILL.md
│   │   ├── code-reviewer/SKILL.md
│   │   ├── fix-dispatcher/SKILL.md
│   │   ├── ac-fix/SKILL.md
│   │   └── exe-build-e2e/SKILL.md
│   └── subagents/               # 8 specialist fix sub-agents
│       ├── validation-fix.md
│       ├── auth-fix.md
│       ├── state-fix.md
│       ├── error-handling-fix.md
│       ├── concurrency-fix.md
│       ├── data-fix.md
│       ├── integration-fix.md
│       └── security-fix.md

No dependencies. No npm install. No API keys. Just Markdown files.

Usage

Full pipeline (recommended)

Tell Claude Code to run the full pipeline:

> Run the exe-build-e2e pipeline for: user authentication with OAuth

Or use the slash command:

> /exe-build-e2e

The pipeline will:

Ask for the feature name and classify complexity (Tier 1/2/3)
Present the classification for your review — you can override ("Make it Tier 2")
After you confirm, run fully autonomously: spec → AC → tests → implement → verify → fix

Starting from a document

If you have a PRD, design doc, or research document:

> Run exe-build-e2e for this design doc: @docs/auth-redesign.md

The pipeline enters Parse Mode — it extracts requirements from your document instead of asking questions, fills gaps with reasonable assumptions (noted in requirements.md), and runs autonomously from there.

Individual skills

Run any skill standalone:

> /spec-writer          # Generate requirements, design, tasks from an idea
> /ac-writer            # Generate acceptance criteria from existing specs
> /test-writer          # Generate tests from acceptance criteria
> /test-runner          # Execute tests and collect results
> /code-evaluator       # Analyze test failures and categorize defects
> /fix-dispatcher       # Dispatch specialists to fix categorized defects
> /code-reviewer        # LLM-native code review against specs

Resuming or continuing the pipeline

> Continue the pipeline from Phase 2 for user-auth    # Resume from a specific phase
> Run Phase 4 only for user-auth                      # Just verify an implementation
> What's the pipeline status for user-auth?            # Check which artifacts exist

Overriding the complexity tier

> Run exe-build-e2e for adding a logout button, make it Tier 1

Tier 1 skips design.md, ac-writer, and code-evaluator — fast track for simple features.

Generated artifacts

All artifacts are written to docs/features/<feature-name>/:

docs/features/user-auth-oauth/
├── requirements.md        # EARS atomic requirements (REQ-001, REQ-002, ...)
├── design.md              # Architecture, state diagrams, API contracts
├── tasks.md               # Implementation plan with test hooks
├── scenarios.md           # Gherkin test scenarios
├── ac.json                # Machine-readable acceptance criteria
├── evaluation.json        # AC coverage evaluation
├── test-results.json      # Test execution results
├── code-evaluation.json   # Defect analysis
├── code-report.md         # Human-readable failure report
├── fix-report.json        # Summary of applied fixes
└── tests/
    └── *.test.ts          # Executable tests (language auto-detected)

Why This Exists

AI coding agents write code that passes tests but misses requirements. exe-build-skills forces every artifact — requirements, acceptance criteria, tests, implementation — to trace back to the original spec, with validation gates that catch drift before it compounds.

The pipeline was built to solve four problems we hit repeatedly:

Problem	What happens without exe-build	How exe-build fixes it
Tests validate the wrong thing	AI writes tests for what it thinks the feature does, not what was asked	Tests are generated from validated acceptance criteria, not from the implementation
Requirements are vague	AI fills gaps with unchecked assumptions	EARS methodology forces every requirement into a precise, testable statement
Fixes are random	AI makes changes until tests pass instead of diagnosing root causes	8 domain-specialist agents fix defects by class (auth, concurrency, security, etc.)
Coverage is invisible	No way to verify all requirements have tests	Numbered REQ → AC → TEST → TASK chain with mechanical coverage checks (E1–E5 gates)

Design Decisions and Trade-offs

Every design choice in exe-build has a reason and a cost. This section documents both so you can decide if the trade-offs make sense for your workflow.

Decision: EARS requirements over free-form specs

What we chose: All requirements must follow EARS (Easy Approach to Requirements Syntax) patterns — Ubiquitous, Event-driven, State-driven, Optional, or Complex.

Why: Free-form requirements like "handle authentication" are untestable. EARS forces every requirement into a precise, testable statement: "When the user submits the login form, the system shall validate credentials against the auth service." This structure is what makes downstream automation possible — ac-writer can enumerate scenarios mechanically, test-writer can generate assertions directly from the pattern.

Trade-off: EARS adds friction to the spec phase. Simple features that "obviously" need two lines of description now get a full requirements.md with metadata fields. For Tier 1 features (< 3 requirements), this overhead is noticeable. We mitigate this with the tiered pipeline — Tier 1 skips design.md and inline AC, cutting the spec phase roughly in half.

Alternative considered: Just generate tests from a natural-language description. This is faster but produces tests that validate the AI's interpretation of the description, not the user's intent. We chose correctness over speed.

Decision: Gated validation with auto-repair loops

What we chose: 10+ explicit validation gates across the pipeline. Each gate has specific pass/fail criteria. Failed gates trigger targeted repair (up to 3 attempts) before the pipeline hard-fails.

Why: Without gates, errors compound. A missing error path in requirements becomes a missing scenario in AC, becomes a missing test, becomes an unhandled edge case in production. Each downstream artifact amplifies the original gap. Gates catch problems at the source.

Trade-off: Gates add token cost. Each gate evaluation requires the AI to re-read artifacts and check criteria. Repair loops multiply this — a single failed gate can cost 3x the tokens of the original generation. On average, the full pipeline uses 2–4x more tokens than ungated generation.

Why we accept this cost: Fixing a bad spec after implementation is 10–50x more expensive than fixing it at the gate. The 2–4x overhead is a bargain compared to rework.

Decision: Specialist sub-agents over generic fix-it

What we chose: 8 domain-expert sub-agents (validation, auth, state, error-handling, concurrency, data, integration, security), each with a 300–450 line playbook.

Why: A generic "fix the failing test" prompt produces shallow fixes — adding a try/catch to suppress an error, hardcoding an expected value, or "fixing" the test to match wrong behavior. Domain specialists know the patterns for their area. The auth-fix specialist knows to check token expiration, session invalidation, and CSRF protection — not just make the auth test pass.

Trade-off: 8 specialist playbooks add ~3,000 lines of Markdown to the repo. Each specialist dispatch is a separate agent invocation with its own context window cost. For simple defects, a generalist would fix it faster and cheaper.

Why we accept this cost: Simple defects are caught by the implementation phase itself. By the time fix-dispatcher runs, the remaining defects are the hard ones — the ones where generic prompting produces band-aid fixes that break something else.

Decision: Full traceability (REQ → AC → TEST → TASK)

What we chose: Every artifact has numbered IDs. Bidirectional links are maintained across all artifacts. Changes to one artifact propagate traceability updates to others.

Why: When a test fails, you need to know which requirement it validates, which acceptance criterion defined the expected behavior, and which task is responsible for the implementation. Without traceability, the AI guesses — and guesses wrong. With traceability, fix-dispatcher knows exactly which specialist to send and exactly which code to change.

Trade-off: ID management is a significant portion of the skill instructions. The backfill phase (Phase 3 of test-writer) exists solely to update tasks.md with TEST-### references. This is pure bookkeeping overhead.

Why we accept this cost: Traceability is what makes the verification loop work. Without it, code-evaluator can't map failures to root causes, and fix-dispatcher can't route defects to the right specialist. The bookkeeping enables everything downstream.

Decision: Autonomous execution after one checkpoint

What we chose: The pipeline has exactly one interactive checkpoint at the start (confirm feature scope, review tier classification). After that, it runs fully autonomously until completion or hard failure.

Why: AI agents lose coherence when they stop and start. Each pause breaks the execution flow, requires re-reading context, and introduces opportunities for scope drift. A single checkpoint at the start validates the user's intent, then the pipeline executes without interruption.

Trade-off: If the pipeline makes a wrong assumption during autonomous execution, it will build on that assumption until a gate catches it (or until it finishes). The user can't course-correct mid-run.

Why we accept this cost: Gates are the course-correction mechanism. A wrong assumption in requirements.md will be caught when ac-writer can't generate valid scenarios for it, or when test-writer's evaluation criteria fail. The gates are the checkpoints — they're just automated.

Decision: TDD-native (tests before implementation)

What we chose: The pipeline generates executable tests from acceptance criteria before any implementation code is written. Implementation means "make the generated tests pass."

Why: When tests are written after code, they test what the code does — not what it should do. AI agents are particularly prone to this: they write code, then write tests that validate exactly that code, including its bugs. By generating tests from AC before implementation, the tests are independent of the implementation and validate the spec, not the code.

Trade-off: Generated tests sometimes require test infrastructure that doesn't exist yet (mocks, fixtures, test databases). The implementation phase has to set up this infrastructure as part of making tests pass, which can be awkward.

Decision: Inline execution by default, agents only for fix specialists

What we chose: The pipeline runs inline (same conversation context) by default. Only fix-dispatcher uses separate agents, and only because specialists benefit from isolated context.

Why: Every agent dispatch loses context. The parent must summarize what happened, the child must re-read artifacts. Running inline preserves the full context of what was generated and why. Fix specialists are the exception because they need deep domain context from their playbooks, and their changes are isolated enough to work without full pipeline context.

Trade-off: Long pipelines can approach context window limits. For Tier 3 features, inline execution can consume 300K–500K tokens in a single conversation. The skill includes a pressure valve: if context exceeds ~800K tokens, remaining phases are split to agents.

Decision: Codebase profiling in requirements.md

What we chose: spec-writer auto-detects your project's language, framework, test framework, build system, and directory structure, then embeds this as a YAML frontmatter block in requirements.md.

Why: Downstream skills need to know how to generate tests (Jest vs pytest vs Go testing), where to put files, and what assertion patterns to use. Rather than having each skill re-scan the codebase, the profile is detected once and consumed by all downstream phases.

Trade-off: The profile can be wrong. Auto-detection from file patterns is heuristic. A monorepo with both Python and TypeScript might get the wrong primary language. The profile supports user overrides (# user-override comments) but the user has to notice the error first.

How exe-build Compares

vs. Superpowers (Subagent-Driven Development)

Superpowers is a workflow plugin for Claude Code that provides brainstorming, plan writing, and subagent-driven execution with two-stage review (spec compliance + code quality).

Where Superpowers excels:

Lower friction for small tasks — brainstorming is conversational, plans are quick
Subagent-per-task model keeps context fresh and prevents pollution
Two-stage review (spec compliance, then code quality) is clean and simple
Plugin marketplace installation (/plugin install) is simpler than copying files (exe-build plugin coming soon)
Works well for plan-execute workflows where the plan is written by a human or the AI

Where exe-build differs:

Dimension	Superpowers	exe-build
Spec format	Natural-language design doc	EARS atomic requirements with metadata (from your design doc/PRD)
Design phase	Built-in brainstorming skill (Socratic Q&A → design doc)	Outsourced — use Perplexity, Context7, or any research tool to create a design doc, then feed it into the pipeline. If the doc has gaps, spec-writer notes assumptions and the Initial Review checkpoint lets you correct scope before autonomous execution. Standalone `/spec-writer` will also ask targeted questions to fill gaps.
Traceability	Task-level: plan task → implementation → spec review → quality review	Requirement-level: REQ → AC → TEST → TASK with numbered IDs and bidirectional links across all artifacts
Validation	Two-stage review per task (spec compliance check, then code quality check with Critical/Important/Minor severity)	10+ gates across the full pipeline (E1–E5 for AC coverage, C1–C5 for implementation, Gate 1–3 for specs) with auto-repair loops on failure
Test generation	Strict TDD per task — same agent writes failing test first, then implements code to pass it (RED-GREEN-REFACTOR)	Tests derived from acceptance criteria before any implementation begins — a separate skill generates all tests from AC, then implementation makes them pass. Tests are independent of the implementer's interpretation.
Fix strategy	Implementer fixes own code based on reviewer feedback, reviewer re-checks	8 domain specialists (auth, security, state, concurrency, etc.) dispatched by defect class — each with a 300–450 line playbook of domain-specific patterns
Failure analysis	Code reviewer categorizes issues by severity (Critical/Important/Minor) with file:line references and fix suggestions	code-evaluator classifies failures by defect type, maps each to root cause and originating requirement, prioritizes by impact, then routes to the right specialist
Execution model	Subagent-per-task — each task gets a fresh agent with clean context, no pollution from previous tasks	Inline by default — entire pipeline runs in one conversation, preserving full context of why each requirement exists and how each AC was derived across all phases
Scaling	Same process for all tasks	Tier 1/2/3 adapts pipeline depth to feature complexity — Tier 1 skips design/AC for simple features, Tier 3 adds architecture review and property-based tests
Token cost	Lower per-task (one implementer + two reviewers per task)	Higher total (full pipeline with gates and repair loops), but front-loads cost into spec validation to prevent expensive rework downstream

The execution model trade-off is fundamental. Superpowers spawns a fresh subagent for every task. Each agent starts clean — no context pollution from previous tasks, no risk of the model "forgetting" instructions mid-conversation. The cost is that each subagent loses the context of what other tasks did. The parent orchestrator summarizes, but summaries lose nuance. exe-build takes the opposite bet: run the entire pipeline inline so that when test-writer generates tests, it has the full context of why spec-writer wrote each requirement and how ac-writer derived each acceptance criterion. No summarization loss. The cost is that long pipelines accumulate context and can approach window limits — exe-build mitigates this by splitting to agents only when context exceeds ~800K tokens or for fix-specialist dispatch where isolated domain expertise matters more than pipeline context.

The test generation trade-off matters. Both systems enforce TDD — tests before code. The difference is where the tests come from. In Superpowers, the same agent that writes the test also writes the implementation. It's disciplined (strict RED-GREEN-REFACTOR), but the tests reflect that single agent's interpretation of the task. In exe-build, tests are generated by a separate skill (test-writer) from validated acceptance criteria that were derived from EARS requirements. The tests are structurally independent of the implementation — they validate the spec, not the implementer's understanding of the spec. This means exe-build catches cases where the implementer would have misunderstood the requirement, because the test was written by a different phase with different inputs.

The design phase trade-off. Superpowers handles both design and execution in one system — brainstorming explores your intent through Socratic questions, produces a design doc, then the planner and subagents build it. exe-build intentionally separates design from execution. We believe you should have an extensive, in-depth conversation about what you want to build — using Perplexity, Context7, Claude, or any research tool — before you start building. Produce a thorough design document or PRD, then feed it into exe-build's pipeline. This isn't a limitation — it's a design choice. The research and brainstorming phase benefits from tools optimized for exploration (web search, deep research, multi-turn conversation). The build phase benefits from tools optimized for structured execution (gates, traceability, auto-repair). Combining both into one system means neither is best-in-class at its job. That said, exe-build doesn't leave you stranded with a bad doc. When you run /exe-build-e2e with an incomplete or vague document, the pipeline will extract what it can, note assumptions for any gaps it finds in requirements.md, and the Initial Review checkpoint gives you a chance to correct scope before autonomous execution begins. You don't need to run individual skills — /exe-build-e2e handles the full flow from document to verified code, including gap handling.

Speed vs Quality: When each system wins

Neither system is universally better. The right choice depends on what you're building.

Superpowers is faster when:

Tasks are well-defined and independent (subagent-per-task parallelizes naturally)
You're iterating rapidly on a feature you understand well
The plan is solid and you need execution speed, not spec validation
You want quick brainstorm → plan → build cycles for small-to-medium features
You need git worktree isolation, branch management, and merge/PR workflows built in

exe-build produces higher coverage when:

Requirements are complex, ambiguous, or derived from a document (EARS forces precision)
Missing an edge case would be costly (gates mechanically check every requirement has tests)
You need to prove traceability for compliance or audit (numbered REQ → AC → TEST → TASK chain)
The feature has many error paths, state transitions, or integration points (ac-writer enumerates them systematically)
Defects are domain-specific (auth bugs need auth expertise, not generic "fix it")

The honest assessment: For a well-understood 30-minute task, Superpowers' subagent model will ship it faster with less token spend. For a complex feature with 8+ requirements where you need confidence that every error path is tested and every requirement is traceable to implementation, exe-build's pipeline catches gaps that per-task reviews miss — because per-task reviews can't see cross-task coverage holes.

Combining them (specific workflows)

These systems have complementary strengths. Superpowers has skills exe-build doesn't — and exe-build has pipeline depth Superpowers doesn't. Here's how to use both:

Workflow 1: Superpowers brainstorming → exe-build pipeline Use Superpowers' brainstorming skill to explore your intent and produce a design doc through Socratic Q&A. Then feed that design doc into exe-build's exe-build-e2e pipeline, which transforms it into EARS requirements, validated acceptance criteria, executable tests, and verified code with full traceability.

Workflow 2: exe-build specs → Superpowers execution Run exe-build's spec phase (spec-writer → ac-writer → test-writer) to generate validated requirements, acceptance criteria, and tests. Then use Superpowers' subagent-driven-development to execute the tasks.md — each task gets a fresh agent with two-stage review, and the tests are already written from AC so the implementer just needs to make them pass.

Workflow 3: Use Superpowers skills that exe-build lacks Superpowers includes skills for areas exe-build doesn't cover:

Superpowers skill	What it does	When to use with exe-build
`using-git-worktrees`	Creates isolated branch workspace	Before starting an exe-build pipeline run — isolate the work
`systematic-debugging`	4-phase root cause investigation	When fix-dispatcher exhausts 3 iterations and you need manual debugging
`verification-before-completion`	"Evidence before claims" — run proof commands before declaring done	After exe-build pipeline completes — final smoke test before merge
`finishing-a-development-branch`	Verify tests → present merge/PR options → clean up	After exe-build pipeline completes — merge the work back to main
`receiving-code-review`	Respond to review feedback with technical rigor, not blind agreement	When a human reviewer has comments on code exe-build generated
`brainstorming`	Socratic design exploration	When you don't have a design doc yet and don't want to use Perplexity

A realistic combined session:

1. /superpowers:using-git-worktrees     → isolate the work on a branch
2. /superpowers:brainstorming           → explore intent, produce design doc
   (or use Perplexity/Context7 for deeper research)
3. /exe-build-e2e @design-doc.md        → spec → AC → tests → implement → verify
4. /superpowers:verification-before-completion → final evidence-based check
5. /superpowers:finishing-a-development-branch → merge or PR

vs. GSD (Get Shit Done)

GSD is a full project management framework with roadmapping, phase planning, execution, and verification — designed for multi-milestone, multi-phase projects.

Where GSD excels:

Project-level orchestration — roadmaps, milestones, multi-phase planning
Rich agent ecosystem (15+ specialized agents: planner, executor, verifier, debugger, UI auditor, etc.)
State management across sessions (STATE.md, checkpoints, resume-work)
Goal-backward verification (checks if the goal was achieved, not just if tasks completed)
UI-specific workflows (UI-SPEC.md, 6-pillar visual audit)
Built for long-running projects with multiple developers

Where exe-build differs:

Dimension	GSD	exe-build
Scope	Full project lifecycle (roadmap → milestone → phase → task)	Single feature (idea → verified implementation)
Planning	Discussion → research → plan → execute → verify per phase	Spec → AC → tests → implement → verify per feature
Requirements	Natural language in CONTEXT.md with locked/deferred decisions	EARS-format atomic requirements with metadata in requirements.md
Verification	Goal-backward (did we achieve the goal?)	Gate-forward (did each artifact pass validation?) + code evaluation
Test approach	Opt-in TDD per task (`tdd="true"` in plan) + Nyquist auditor fills coverage gaps post-implementation	TDD-native by default — all tests generated from AC before any implementation begins
Fix approach	Debugger agent with scientific method	8 domain specialists dispatched by defect classification
State persistence	STATE.md, SUMMARY.md, checkpoints across sessions	Artifacts in docs/features/ (stateless pipeline, re-runnable)
Agent count	15+ specialized agents	9 skills + 8 sub-agents
Token cost	High (many agent spawns per phase)	High (deep per-feature pipeline)

The scope trade-off is the key difference. GSD manages the project — what gets built when, in what order, across how many phases. exe-build manages the feature — given a feature to build, produce a verified implementation. GSD asks "what should Phase 3 of Milestone 2 deliver?" exe-build asks "does this implementation satisfy all 7 requirements for the auth feature?" These are different questions at different levels of abstraction, which is why the systems compose well rather than compete.

The verification trade-off matters. GSD's goal-backward verification is powerful — it starts from the desired outcome and checks whether the codebase actually delivers it, regardless of whether tasks were marked complete. This catches the "placeholder problem" where a task is technically done (file created) but the goal isn't achieved (component is a stub). exe-build's gate-forward validation catches a different class of problems — coverage gaps where a requirement has no AC, an AC has no test, or a test has no task. Goal-backward asks "does it work?" Gate-forward asks "is everything covered?" You want both.

The testing trade-off. GSD supports TDD per task (opt-in via tdd="true" in the plan), and its Nyquist auditor retroactively fills test coverage gaps after implementation. exe-build generates all tests from acceptance criteria before implementation begins — every test exists before a line of code is written. GSD's approach is more flexible (you choose when to apply TDD), exe-build's is more rigorous (tests are structurally independent of implementation). GSD's Nyquist auditor is a safety net that catches what was missed; exe-build's evaluation criteria (E1–E5) are a gate that blocks progress until coverage is complete.

The state persistence trade-off. GSD maintains rich state across sessions — STATE.md tracks position, SUMMARY.md records what happened, checkpoints allow pausing and resuming across context resets. This makes GSD excellent for multi-day, multi-session projects. exe-build is stateless by design — all state lives in the artifact files (docs/features/*/). You can re-run any phase at any time by pointing it at the existing artifacts. The trade-off: GSD handles interruptions gracefully, exe-build handles re-runs gracefully.

The fix approach trade-off. GSD's debugger agent uses the scientific method — hypothesis, experiment, conclusion. It's a generalist that can investigate any bug through systematic reasoning. exe-build's fix-dispatcher routes defects to 8 domain specialists, each with a detailed playbook. The debugger is better for novel, unexpected bugs where you don't know the category. The specialists are better for known defect classes (auth, concurrency, data integrity) where domain-specific patterns and anti-patterns matter more than general investigation.

Speed vs Quality: When each system wins

GSD is stronger when:

You're managing a multi-milestone project that spans weeks or months
Work happens across multiple sessions and needs persistent state
Phases have diverse types of work (UI, backend, infrastructure) that benefit from specialized agents (UI auditor, codebase mapper, integration checker)
You need a research phase before planning — GSD has dedicated researcher and synthesizer agents
The project involves UI work — GSD has UI-SPEC.md, 6-pillar visual audit, and UI-specific quality checks

exe-build is stronger when:

You need deep, feature-level spec-to-test coverage with full traceability
Requirements must be precise and testable (EARS eliminates ambiguity)
You need mechanical proof that every requirement has acceptance criteria and every AC has tests (E1–E5 gates)
Defects need domain-expert fixes rather than general debugging
The feature is self-contained and can be specified, tested, and verified in one pipeline run

The honest assessment: GSD is the better project management system — it handles the complexity of coordinating multi-phase work across sessions. exe-build is the better feature verification system — it ensures that any individual feature is thoroughly specified, tested, and traceable. They operate at different levels and don't conflict.

Combining them (specific workflows)

GSD and exe-build compose naturally because they operate at different levels — project vs feature.

Workflow 1: GSD roadmap → exe-build per feature Use GSD to create the project roadmap, break it into milestones and phases, and manage cross-phase dependencies. When a phase requires building a feature with rigorous spec-to-test coverage, GSD's executor can delegate to exe-build's pipeline. The exe-build artifacts (requirements.md, ac.json, tests/) become the phase deliverables that GSD's verifier checks.

Workflow 2: exe-build specs → GSD execution Run exe-build's spec phase (spec-writer → ac-writer → test-writer) to generate validated requirements, AC, and tests. Then use GSD's executor to implement the tasks.md with its deviation handling, atomic commits, and checkpoint protocols. GSD's state persistence means you can pause mid-implementation and resume in a new session.

Workflow 3: Use GSD agents that exe-build lacks GSD has agents for areas exe-build doesn't cover:

GSD agent	What it does	When to use with exe-build
`gsd-debugger`	Scientific-method bug investigation with persistent debug sessions	When fix-dispatcher exhausts 3 iterations — switch to systematic root cause investigation
`gsd-verifier`	Goal-backward verification (does the code achieve the goal, not just complete tasks?)	After exe-build pipeline completes — verify the feature actually works end-to-end, not just that gates passed
`gsd-ui-auditor`	6-pillar visual audit of frontend code	After exe-build builds a frontend feature — audit the visual quality
`gsd-ui-researcher`	Produces UI-SPEC.md design contracts	Before running exe-build on a frontend feature — define the visual spec
`gsd-codebase-mapper`	Parallel codebase analysis across tech, architecture, quality, concerns	Before starting exe-build on a large unfamiliar codebase — understand what you're working with
`gsd-integration-checker`	Verifies cross-phase integration and E2E user flows	After multiple exe-build pipeline runs — check that features connect properly

A realistic combined session:

1. /gsd:new-project                     → roadmap with milestones and phases
2. /gsd:plan-phase                      → plan Phase 1 with research and task breakdown
3. /exe-build-e2e @phase-1-design.md    → spec → AC → tests → implement → verify
4. /gsd:verify-work                     → goal-backward check: did Phase 1 deliver?
5. /gsd:plan-phase                      → plan Phase 2
6. /exe-build-e2e @phase-2-design.md    → repeat for Phase 2
7. /gsd:audit-milestone                 → audit milestone completion before moving on

vs. Standard TDD / SDD Workflows

Standard TDD (Test-Driven Development) and SDD (Spec-Driven Development, e.g., open-spec) workflows follow a simpler pattern: write tests or specs by hand, implement code, verify. This is how most professional software gets built today — no AI pipeline, no token cost, just developers and their tools.

Where standard TDD/SDD excels:

Minimal overhead — no pipeline, no gates, no artifact generation
Human judgment drives the spec (not AI generation that might hallucinate requirements)
No token cost beyond the implementation itself
Well-understood by any developer — no tooling setup, no learning curve
The developer who writes the spec has full domain context that no AI possesses
Battle-tested across decades of real-world software projects

Where exe-build differs:

Dimension	Standard TDD/SDD	exe-build
Who writes specs	Human developer with domain expertise	AI generates from idea/doc, human reviews once
Who writes tests	Human developer	AI generates from validated AC
Spec format	Varies (user stories, Gherkin, free text, internal docs)	EARS atomic requirements + machine-readable AC JSON
Traceability	Manual (developer keeps track, or uses JIRA/Linear links)	Automatic (REQ → AC → TEST → TASK with numbered IDs, bidirectional)
Validation	Human review + CI pipeline	10+ automated gates with repair loops, then CI
Coverage gaps	Caught by code review, QA, or production incidents	Caught by evaluation criteria E1–E5 (mechanically, before implementation)
Fix routing	Developer debugs and fixes based on experience	Automated defect classification + specialist dispatch
Iteration speed	Depends on team size and developer speed	AI generates in minutes, but pipeline adds overhead
Token cost	Zero (human-driven)	50K–500K+ tokens per feature depending on complexity
Wall-clock time	Hours to days (human writes everything)	Minutes to hours (AI generates, human reviews)

The authorship trade-off. When a human writes specs and tests, every decision reflects their domain knowledge, their understanding of edge cases from years of experience, and their intuition about what matters. No AI can match a senior developer's domain expertise. But human-authored specs have blind spots — the developer doesn't write tests for the edge cases they don't think of. exe-build's mechanical coverage checks (E1–E5) catch a different class of gaps: the systematic ones. "Every requirement has at least one AC, every error case has a test, every state transition is covered" — these are checks that don't require domain expertise, just thoroughness. A human is better at knowing which edge cases matter most. exe-build is better at ensuring none are accidentally skipped.

The traceability trade-off. In manual TDD, traceability lives in the developer's head, in JIRA ticket links, or in commit messages. It's implicit and often incomplete. exe-build's numbered ID system (REQ-001 → AC-001 → TEST-001 → TASK-001) is explicit and machine-readable, but it's also rigid — it adds structural overhead that a small team doing TDD in a well-understood codebase doesn't need. Traceability matters most when: (a) requirements come from external stakeholders who need proof of coverage, (b) the codebase has compliance requirements, or (c) the team is large enough that implicit knowledge gets lost. For a solo developer on a personal project, it's overkill.

The cost trade-off. Standard TDD costs zero tokens but costs developer hours. exe-build costs 50K–500K tokens but generates specs, tests, and traceability in minutes. The right comparison isn't "free vs expensive" — it's "developer time vs token spend." If developer time is cheap and domain knowledge is critical, manual TDD wins. If developer time is expensive and you need validated specs fast, exe-build wins.

The trust trade-off. Human-authored specs are trusted by default — the developer understands what they wrote. AI-generated specs require review — the developer must verify that the AI understood the intent correctly. This review cost is real but often overlooked. exe-build mitigates it with the Initial Review checkpoint and EARS formatting (structured requirements are easier to verify than free-form prose), but it doesn't eliminate it.

Speed vs Quality: When each system wins

Standard TDD/SDD is better when:

The developer has deep domain expertise and knows the edge cases from experience
The feature is well-understood and doesn't need formal specification
Token cost is a concern and developer time is available
The team has established spec/test patterns and doesn't need tooling to enforce them
Requirements come from the developer's own understanding (not an external doc)
The project is small enough that implicit traceability (commit messages, ticket links) is sufficient

exe-build is better when:

Requirements come from a document (PRD, design doc, research) that needs to be transformed into testable specs
The feature has many error paths, state transitions, or integration points that a human might not enumerate completely
You need explicit, auditable traceability (compliance, external stakeholders, handoffs between teams)
Developer time is more expensive than token cost
You want coverage guarantees before implementation, not coverage discovery after production incidents
The spec author and the implementer are different people (or different AI agents) — exe-build's separation of concerns prevents the implementer from writing tests that validate their own misunderstanding

The honest assessment: A senior developer doing disciplined TDD will produce better specs and more insightful tests than exe-build for features in their area of expertise. exe-build will produce more complete coverage (no gaps in requirement-to-test mapping) and do it faster in wall-clock time. The ideal workflow may be: use exe-build to generate the initial spec and tests, then have a senior developer review and refine them — combining mechanical completeness with human judgment.

Combining them

exe-build and manual TDD are not mutually exclusive. The pipeline generates artifacts that fit naturally into a standard development workflow.

Workflow 1: exe-build specs → human TDD implementation Run exe-build's spec phase (spec-writer → ac-writer → test-writer) to generate validated requirements, acceptance criteria, and test skeletons. Then hand the artifacts to a developer who implements using traditional TDD — they have a complete spec, pre-written test cases, and a traceability matrix. The developer can modify, add, or remove tests based on their domain expertise.

Workflow 2: Human specs → exe-build test generation Write requirements and design docs by hand (the way you always have). Then run just /test-writer to generate tests from your specs. exe-build's evaluation criteria (E1–E5) will mechanically check whether your hand-written specs have coverage gaps — missing error paths, untested state transitions, requirements without acceptance criteria. This uses exe-build as a coverage audit tool rather than a full pipeline.

Workflow 3: exe-build as CI gate Integrate exe-build's evaluation criteria into your existing CI pipeline. After a developer writes specs and tests manually, run the evaluation phase to verify coverage completeness. This adds mechanical coverage checking to your existing human-driven workflow without replacing it.

A realistic combined session:

1. Developer writes design doc manually (or with Perplexity/research)
2. /spec-writer @design-doc.md          → EARS requirements from the doc
3. Developer reviews requirements.md, corrects domain-specific issues
4. /ac-writer                           → scenarios + acceptance criteria
5. /test-writer                         → generated tests + coverage evaluation
6. Developer reviews tests, adds domain-specific edge cases
7. Developer implements using traditional TDD (tests are already written)
8. /test-runner                         → verify all tests pass

Token Usage and Model Considerations

Token cost by pipeline phase

These are approximate costs for a Tier 2 feature (3–8 requirements) using Claude Opus 4.6:

Phase	Input tokens	Output tokens	Notes
spec-writer	15K–30K	8K–15K	Codebase scan + 3 artifacts + 3 gates
ac-writer	10K–20K	5K–10K	Reads specs, generates scenarios + AC + 2 gates
test-writer	15K–30K	10K–20K	Evaluation + test generation + backfill + 3 gates
Implementation	20K–50K	15K–40K	Reads tasks + tests, writes code
test-runner	5K–10K	2K–5K	Executes tests, collects results
code-evaluator	15K–25K	8K–15K	Failure mapping + classification + recommendations
fix-dispatcher	10K–20K per specialist	5K–15K per specialist	Only if fixes needed
Repair loops	10K–20K per attempt	5K–10K per attempt	Only if gates fail

Typical total for a Tier 2 feature: 100K–250K tokens (no repair loops) to 300K–500K tokens (with repair loops and fix iterations).

Tier 1 (abbreviated): 30K–80K tokens. Skips design.md, ac-writer, and code-evaluator.

Tier 3 (extended): 300K–800K+ tokens. Adds architecture review, property-based tests, and code review.

Why the token cost is worth it

The alternative is not "zero tokens" — it's "the same tokens spent on rework." Without the pipeline:

AI writes code from a vague description (~50K tokens)
Tests fail or are wrong (~20K tokens to debug)
Human discovers a missed requirement (~50K tokens to rework)
More tests fail (~20K tokens)
Repeat until frustrated

Total: 150K–300K tokens with lower quality and no traceability. The pipeline front-loads the cost into structured spec work that prevents rework.

Model selection

Recommended: Claude Opus 4.6 (1M context)

The pipeline was designed for and tested with Claude Opus 4.6. Here's why:

Context window: A full Tier 2 pipeline generates 80K–150K tokens of artifacts. Inline execution (our default) needs the full pipeline context in one conversation. Opus 4.6's 1M context handles this easily. Smaller context models (32K–128K) would force agent splitting, losing context and quality.
Instruction following: The skills contain precise, multi-step instructions with conditional logic ("if E1–E4 fails, enter repair loop"). Opus 4.6 follows these reliably. Smaller models skip steps or invent shortcuts.
EARS generation: Producing correct EARS-format requirements with proper metadata fields requires understanding the pattern system. Opus 4.6 generates valid EARS consistently. Sonnet 4.6 works for simpler features but occasionally drops metadata fields.

Will it work with other models?

Model	Viability	Notes
Claude Opus 4.6	Full support	Designed for this model
Claude Sonnet 4.6	Good for Tier 1–2	May need agent splitting for Tier 3; occasional gate issues
Claude Haiku 4.5	Tier 1 only	Struggles with multi-gate validation and complex EARS patterns
GPT-4o / o1	Untested	Skills are model-agnostic Markdown; should work with prompt adaptation
Other agents	Untested	Any agent that reads `.md` skill files can use these

Reducing token cost

If token cost is a concern:

Use Tier 1 for small features. Override the auto-classifier: "Make it Tier 1." This skips design.md, ac-writer, and code-evaluator.
Run individual skills. Use /spec-writer alone to generate specs, then implement manually.
Skip the fix loop. If your implementation is close, run test-runner and fix manually instead of letting fix-dispatcher iterate.
Use Sonnet 4.6 for Tier 1. The abbreviated pipeline is simple enough for Sonnet.

Security and Auditability

Everything is auditable

These skills are plain Markdown files. There is no:

Compiled code or binaries
Network calls or telemetry
External service dependencies
Encrypted or obfuscated content
Dynamic code generation beyond what your AI agent does natively
Package dependencies or install scripts

Every instruction the AI receives is readable text. If a skill tells the AI to do something, you can see exactly what it says. The entire system is ~14,500 lines of Markdown across 9 skill files and 8 specialist sub-agent playbooks.

Audit the entire skill set in one command

With Claude Code (recommended — audits everything at once):

claude -p "Read every SKILL.md file in .claude/skills/ and every .md file in \
  .claude/subagents/. For each file, report: (1) what it instructs the AI to do, \
  (2) what files it reads, (3) what files it writes or modifies, (4) any shell \
  commands it executes, (5) any network or external service access, (6) anything \
  that could be a security concern. Summarize with a risk assessment."

With any other AI agent (Cursor, Windsurf, Codex, etc.):

Point the agent at the skill directories and ask the same question. The files are standard Markdown — any tool that can read text can audit them.

With grep (quick automated scan for risky patterns):

# Check for network/fetch/curl/API instructions
grep -ri "curl\|wget\|fetch(\|http://\|https://\|api\..*\.com" .claude/skills/ .claude/subagents/

# Check for file deletion instructions
grep -ri "rm \|rm -\|delete\|rmdir\|unlink\|fs\.remove" .claude/skills/ .claude/subagents/

# Check for shell execution beyond test running
grep -ri "child_process\|subprocess\|os\.system\|exec(\|spawn(" .claude/skills/ .claude/subagents/

# Check for credential or secret handling
grep -ri "api.key\|api_key\|secret\|credential\|\.env" .claude/skills/ .claude/subagents/

# Check for base64 or encoding (could hide content)
grep -ri "base64\|btoa\|atob\|encode(" .claude/skills/ .claude/subagents/

If any of these return results, read the context — some keywords appear in the AC generation examples (like "password" in an auth feature example) but do not instruct the AI to handle credentials.

Diffing against upstream (tamper detection):

# After cloning, verify no modifications
git diff origin/main -- .claude/skills/ .claude/subagents/

What the skills actually do (permission matrix)

Skill	Reads	Writes	Executes
spec-writer	Your codebase structure (to detect stack)	`docs/features/*/requirements.md`, `design.md`, `tasks.md`	Nothing
ac-writer	Requirements and design files	`scenarios.md`, `ac.json`	Nothing
test-writer	AC files, your codebase	`tests/.test.`, `evaluation.json`, `testids.json`	Nothing
test-runner	Test files	`test-results.json`	Your test command (`npm test`, `pytest`, etc.)
code-evaluator	Test results, spec files	`code-evaluation.json`, `code-report.md`	Nothing
fix-dispatcher	Evaluation results, spec files	Your source code (targeted fixes)	Nothing
code-reviewer	Your code, spec files	`code-review.md`	Nothing
ac-fix	Evaluation feedback, AC files	`ac-fixed.json`	Nothing
exe-build-e2e	All of the above (orchestrator)	All of the above	Delegates to test-runner

The only skill that executes shell commands is test-runner, and it runs your project's existing test command (the same command you'd run manually).

fix-dispatcher modifies your source code, but only the specific files identified by code-evaluator's failure analysis. It does not modify files outside the scope of the failing tests.

Permissions model

These skills operate within your AI agent's existing permission model:

Claude Code: If you use --allowedTools or permission prompts, skills cannot bypass them. The skills are instructions — your agent enforces the actual permissions.
Other agents: Skills operate within the agent's built-in approval flow. File writes require your approval unless auto-approved.
Any agent: The skills never circumvent the agent's built-in permission system. They are text instructions, not code that runs independently.

How It Works

Skill format

Each skill is a Markdown file with YAML frontmatter:

---
name: spec-writer
version: v3
description: When to use this skill
---

# Skill Name

## Overview
What the skill does.

## When to Use
Trigger conditions.

## Pipeline
Step-by-step instructions with validation gates.

## Gates
Validation criteria that must pass before proceeding.

The AI agent reads these files as instructions. The frontmatter enables discovery and routing. The Markdown body is the actual behavior specification.

Pipeline architecture

The exe-build-e2e skill orchestrates the others in sequence:

spec-writer → ac-writer → test-writer → [implementation] → test-runner → code-evaluator → fix-dispatcher
                                                                │              │
                                                                ▼              ▼
                                                          code-reviewer    8 specialist
                                                          [Tier 3]         sub-agents

Each skill:

Declares its inputs (which files must exist)
Validates inputs before proceeding
Generates outputs with deterministic naming
Runs validation gates on its outputs
Auto-repairs gate failures (up to 3 attempts)

EARS methodology

Requirements use the EARS (Easy Approach to Requirements Syntax) patterns:

Pattern	Template	Example
Ubiquitous	The system shall [action]	The system shall encrypt passwords at rest
Event-driven	When [event], the system shall [action]	When session expires, the system shall redirect to login
State-driven	While [state], the system shall [action]	While offline, the system shall queue mutations
Optional	Where [condition], the system shall [action]	Where MFA is enabled, the system shall require a second factor
Complex	While [state], when [event], the system shall [action]	While rate-limited, when new request arrives, the system shall return 429

This eliminates ambiguous requirements like "the system should handle authentication" and forces precise, testable statements.

Complexity tiers

Tier	Requirements	Pipeline	What's included	What's skipped
Tier 1	< 3	Abbreviated	requirements.md, tests, test-runner	design.md, ac-writer, code-evaluator
Tier 2	3–8	Standard	All phases, all gates, fix loop	Property-based tests, code review
Tier 3	> 8 or multi-service	Extended	Everything + architecture review, property-based tests, code review	Nothing

The pipeline auto-classifies based on requirement count but you can override: "Make it Tier 1."

Specialist sub-agents

When code-evaluator identifies defects, fix-dispatcher routes each to the right expert:

Sub-agent	Domain	Playbook size
`validation-fix`	Input validation, schema enforcement	~264 lines
`auth-fix`	Authentication, authorization, sessions	~346 lines
`state-fix`	State machines, race conditions, lifecycle	~314 lines
`error-handling-fix`	Error propagation, recovery, boundaries	~385 lines
`concurrency-fix`	Threading, async, deadlocks	~370 lines
`data-fix`	Data transforms, serialization, integrity	~386 lines
`integration-fix`	API contracts, protocols, versioning	~416 lines
`security-fix`	Injection, XSS, CSRF, cryptography	~345 lines

Each playbook contains domain-specific patterns, anti-patterns, verification strategies, and example fixes.

Project Structure

exe-build-skills/
├── .claude/
│   ├── skills/              # Skills for Claude Code (9 skills)
│   │   ├── spec-writer/     #   EARS requirements + design + tasks
│   │   ├── ac-writer/       #   Scenarios + acceptance criteria
│   │   ├── test-writer/     #   Test generation + evaluation
│   │   ├── test-runner/     #   Test execution + results
│   │   ├── code-evaluator/  #   Failure analysis + defect classification
│   │   ├── code-reviewer/   #   Spec compliance review
│   │   ├── fix-dispatcher/  #   Specialist routing + fix orchestration
│   │   ├── ac-fix/          #   AC repair from evaluation feedback
│   │   └── exe-build-e2e/   #   Pipeline orchestrator
│   ├── subagents/           # 8 specialist fix sub-agent playbooks
│   └── orchestrator/        # Node.js pipeline runner (optional)
├── .opencode/
│   └── skills/              # Same skills for OpenCode
├── ac-methods/              # 12 AC generation methodologies
├── docs/
│   └── features/            # Example generated specifications
└── README.md

Roadmap

v2.1.0 (coming soon)

Skill	Description	Status
code-review (standalone)	Git-diff-based code review that works on any code without requiring spec files. Review any commit range, PR, or working tree changes. No pipeline dependencies.	Planned
security-review	Dedicated security audit — OWASP top-10, dependency vulnerability scan, secrets detection, auth/authz review, input validation coverage. Standalone or pipeline-integrated.	Planned
code-refactor	Structured refactoring with safety nets — identify refactor targets, generate characterization tests before changing code, validate behavior preservation after refactor, track what moved where.	Planned
Plugin marketplace	Install exe-build via `/plugin install` instead of copying files.	Planned

These skills will work both standalone (run on any codebase, no spec files needed) and as pipeline extensions (feed findings into code-evaluator and fix-dispatcher).

Contributing

Fork the repo
Edit or add skills in .claude/skills/
Follow the existing skill format (YAML frontmatter + structured Markdown)
Test your changes by running the pipeline against a real feature
Submit a PR

Skills should be:

Self-contained: No external dependencies
Deterministic: Same input produces predictable output structure
Gated: Include validation criteria
Auditable: Plain Markdown, no obfuscation

License

MIT

FAQ

Q: Do I need Claude Code specifically? A: No. The repo includes skills for both Claude Code (.claude/skills/) and OpenCode (.opencode/skills/). They also work with Cursor, Windsurf, or any agent that reads .md instruction files from your repo.

Q: Do these skills send data anywhere? A: No. The skills are local instruction files. They don't make network calls, phone home, or transmit anything. Your AI agent processes them locally.

Q: Can I use individual skills without the full pipeline? A: Yes. Each skill works standalone. Use /spec-writer to just generate specs, /test-writer to just generate tests, etc.

Q: What languages are supported? A: The skills auto-detect your project's language and framework. They've been tested with TypeScript, JavaScript, Python, Go, and Rust. The test generation adapts to your project's test framework.

Q: How do I verify the skills haven't been tampered with? A: grep them. cat them. diff them against upstream. Ask any AI to audit them. They're plain text — every instruction is visible. See the Security and Auditability section for specific commands.

Q: What if a gate fails after 3 repair attempts? A: The pipeline hard-fails and reports the specific failing criteria with context. It doesn't silently skip validation or push through broken artifacts. You can fix the issue manually and resume from that phase.

Q: How does this work with Superpowers or GSD? A: They complement each other. Use GSD for project-level planning (what features to build, in what order), exe-build for feature-level spec-to-code (turning each feature into verified implementation), and Superpowers for task-level execution (subagent-per-task with reviews). See the comparison section for details.

Q: Is the token cost justified? A: For features with 3+ requirements, yes. The pipeline front-loads cost into structured spec work that prevents rework. Without it, you spend similar tokens on debugging and re-implementation, but without traceability or coverage guarantees. For trivial features (< 3 requirements), use Tier 1 or individual skills to reduce cost. See Token Usage for detailed breakdowns.

Q: Can I use a cheaper model? A: Sonnet 4.6 works for Tier 1–2 features. Haiku works for Tier 1 only. The pipeline was designed for Opus 4.6 and its instruction-following reliability. See the model table for specifics.

If exe-build-skills saved you from a bad implementation or caught a requirement gap before production, consider starring the repo — it helps other developers find it.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.claude		.claude
.opencode/skills		.opencode/skills
Exe TUI		Exe TUI
ac-methods-results		ac-methods-results
ac-methods		ac-methods
actest		actest
docs		docs
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

exe-build-skills — Spec-to-Verified-Code Pipeline for AI Agents

Table of Contents

Quick Start

Installation

Claude Code (recommended)

OpenCode

Other AI Agents (Cursor, Windsurf, etc.)

What gets installed

Usage

Full pipeline (recommended)

Starting from a document

Individual skills

Resuming or continuing the pipeline

Overriding the complexity tier

Generated artifacts

Why This Exists

Design Decisions and Trade-offs

Decision: EARS requirements over free-form specs

Decision: Gated validation with auto-repair loops

Decision: Specialist sub-agents over generic fix-it

Decision: Full traceability (REQ → AC → TEST → TASK)

Decision: Autonomous execution after one checkpoint

Decision: TDD-native (tests before implementation)

Decision: Inline execution by default, agents only for fix specialists

Decision: Codebase profiling in requirements.md

How exe-build Compares

vs. Superpowers (Subagent-Driven Development)

Speed vs Quality: When each system wins

Combining them (specific workflows)

vs. GSD (Get Shit Done)

Speed vs Quality: When each system wins

Combining them (specific workflows)

vs. Standard TDD / SDD Workflows

Speed vs Quality: When each system wins

Combining them

Token Usage and Model Considerations

Token cost by pipeline phase

Why the token cost is worth it

Model selection

Reducing token cost

Security and Auditability

Everything is auditable

Audit the entire skill set in one command

What the skills actually do (permission matrix)

Permissions model

How It Works

Skill format

Pipeline architecture

EARS methodology

Complexity tiers

Specialist sub-agents

Project Structure

Roadmap

v2.1.0 (coming soon)

Contributing

License

FAQ

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages