Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,32 @@ marketplace; see that repo's CHANGELOG for the catalog-side history.

## [Unreleased]

### Added

- `references/codex-prompt-blocks.md` — composable XML prompt blocks
(`<verification_loop>`, `<grounding_rules>`, `<action_safety>`, …), four
task recipes, and a prompt anti-pattern table for judgment-sensitive briefs.
Adapted from the `gpt-5-4-prompting` skill in
[`openai/codex-plugin-cc`](https://github.com/openai/codex-plugin-cc)
(Apache-2.0); reframed for this skill's single-shot brief-file workflow.

### Changed

- `SKILL.md` and `references/task-template.md` now point judgment-sensitive
tasks (debugging, write-capable changes, review, research) at the new
prompt-blocks reference; mechanical sweeps keep the flat template.
- `references/patterns.md` (Pattern 5) and `references/review-checklist.md`:
added an explicit "do not auto-apply fixes from a review run — present
findings and ask the user first" rule.
- `references/model-selection.md`: documented the `spark` shorthand
(`--model gpt-5.3-codex-spark`).

### Fixed

- `references/patterns.md` (Pattern 5): the review-mode code block used the
deprecated `--full-auto` flag; replaced with `--sandbox workspace-write` to
match `SKILL.md`, `wrapper.md`, and `examples.md`.

## [0.1.0] - 2026-05-15

The initial published version. Captures the skill state at commit
Expand Down
3 changes: 2 additions & 1 deletion skills/codex-delegate/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Full routing table and good/bad examples: `references/delegation-targets.md`.

## Workflow

1. **Brief**: write `.ai/codex_task_<name>.md` with Context / Goal / Constraints / Acceptance. Template: `references/task-template.md`. If the brief was already written by `agent-task-splitter` at `.ai/codex_task_<NNN>_<slug>.md`, read `.coord/plan.yml` for round context first.
1. **Brief**: write `.ai/codex_task_<name>.md` with Context / Goal / Constraints / Acceptance. Template: `references/task-template.md`. For a judgment-sensitive task (debugging, write-capable change, review, research) — where an unsupported guess or a half-finished fix would hurt — also add the XML prompt blocks from `references/codex-prompt-blocks.md` to the Goal/Constraints. A pure mechanical sweep does not need them. If the brief was already written by `agent-task-splitter` at `.ai/codex_task_<NNN>_<slug>.md`, read `.coord/plan.yml` for round context first.

2. **Run**: from Claude Code Bash, invoke the wrapper from its install location:
```bash
Expand Down Expand Up @@ -112,6 +112,7 @@ Full routing table and good/bad examples: `references/delegation-targets.md`.
- `references/delegation-targets.md` — when to use vs avoid
- `references/wrapper.md` — full wrapper invocation, env vars, Windows runner notes
- `references/task-template.md` — task brief template
- `references/codex-prompt-blocks.md` — XML prompt blocks + recipes + anti-patterns for judgment-sensitive briefs
- `references/output-contract.md` — full `.result.json` schema, status semantics, `.fallback_claude` quota sentinel
- `references/review-checklist.md` — extended acceptance gate
- `references/patterns.md` — five single-task delegation shapes (context file, parallel, resume, structured output, review mode)
Expand Down
248 changes: 248 additions & 0 deletions skills/codex-delegate/references/codex-prompt-blocks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
# Codex prompt blocks — phrasing the task so Codex performs well

`task-template.md` defines the *shape of the brief file* (Context / Goal /
Constraints / Acceptance). This file defines the *phrasing layer*: a set of
composable XML-tagged blocks you drop **inside** the brief's Goal and
Constraints when the task is judgment-sensitive.

For a pure mechanical sweep (batch rename, docstring add) the flat template is
enough — do not bolt these blocks on. Reach for them when an unsupported guess,
a half-finished fix, or an unrelated refactor would actually hurt the result:
debugging, write-capable changes, review, research.

> Adapted from the `gpt-5-4-prompting` skill in
> [`openai/codex-plugin-cc`](https://github.com/openai/codex-plugin-cc)
> (Apache-2.0). Reframed for this skill's single-shot brief-file workflow.

## Where the blocks go

The wrapper runs `--prompt "Read .ai/codex_task_<name>.md and execute..."`, so
Codex reads the brief file. XML tags inside that markdown file are read fine.
Put the blocks under the brief's Goal (what done looks like) or Constraints
(how to stay safe). Keep them compact — a better contract beats raising the
reasoning effort or padding the prompt with prose.

Core rules:

- One clear task per Codex run. Split unrelated asks into separate runs.
- Tell Codex what *done* looks like; do not assume it infers the end state.
- Add a block only where the task needs it. Remove redundant ones before sending.
- Use the exact tag names below so briefs stay consistent across runs.

## Block library

Wrap each block in the XML tag shown. Pick the smallest set that fits.

### `task` — use in nearly every brief

```xml
<task>
The concrete job, the relevant repo or failure context, and the expected end state.
</task>
```

### `structured_output_contract` — when the response shape matters

```xml
<structured_output_contract>
Return exactly the requested output shape and nothing else.
Keep it compact. Put the highest-value findings or decisions first.
</structured_output_contract>
```

### `compact_output_contract` — concise prose instead of a schema

```xml
<compact_output_contract>
Keep the final answer compact and structured.
No long scene-setting, no repeated recap.
</compact_output_contract>
```

### `default_follow_through_policy` — when Codex should act, not ask

```xml
<default_follow_through_policy>
Default to the most reasonable low-risk interpretation and keep going.
Only stop to ask when a missing detail changes correctness, safety, or an
irreversible action.
</default_follow_through_policy>
```

### `completeness_contract` — multi-step work that must not stop early

```xml
<completeness_contract>
Resolve the task fully before stopping. Do not stop at the first plausible
answer. Check for follow-on fixes, edge cases, or cleanup needed for a
correct result.
</completeness_contract>
```

### `verification_loop` — when correctness matters

```xml
<verification_loop>
Before finalizing, verify the result against the task requirements and the
changed files or tool outputs. If a check fails, revise — do not report the
first draft.
</verification_loop>
```

### `missing_context_gating` — when Codex might otherwise guess

```xml
<missing_context_gating>
Do not guess missing repository facts. If required context is absent, retrieve
it with tools or state exactly what remains unknown.
</missing_context_gating>
```

### `grounding_rules` — review, research, root-cause analysis

```xml
<grounding_rules>
Ground every claim in the provided context or your tool outputs.
Do not present inferences as facts. Label any hypothesis clearly.
</grounding_rules>
```

### `action_safety` — write-capable or potentially broad tasks

```xml
<action_safety>
Keep changes tightly scoped to the stated task. Avoid unrelated refactors,
renames, or cleanup unless required for correctness. Call out any risky or
irreversible action before taking it.
</action_safety>
```

### `research_mode` — exploration, comparisons, recommendations

```xml
<research_mode>
Separate observed facts, reasoned inferences, and open questions.
Prefer breadth first, then go deeper only where evidence changes the answer.
</research_mode>
```

### `dig_deeper_nudge` — review and adversarial inspection

```xml
<dig_deeper_nudge>
After the first plausible issue, check for second-order failures, empty-state
behavior, retries, stale state, and rollback paths before finalizing.
</dig_deeper_nudge>
```

### `progress_updates` — long-running, tool-heavy runs

```xml
<progress_updates>
Keep progress updates brief and outcome-based. Mention only major phase
changes or blockers.
</progress_updates>
```

## Recipes

Copy the smallest recipe that fits, then trim. These slot into the brief's
Goal/Constraints; the brief still carries Context (file lists) and Acceptance
(verification commands) per `task-template.md`.

### Narrow fix

```xml
<task>
Implement the smallest safe fix for the identified issue. Preserve existing
behavior outside the failing path.
</task>
<structured_output_contract>
Return: 1. summary of the fix 2. touched files 3. verification performed
4. residual risks or follow-ups
</structured_output_contract>
<completeness_contract>
Resolve the task fully. Do not stop after identifying the issue without
applying the fix.
</completeness_contract>
<verification_loop>
Before finalizing, verify the fix matches the requirements and the changed
code is coherent.
</verification_loop>
<action_safety>
Keep changes tightly scoped. Avoid unrelated refactors or cleanup.
</action_safety>
```

### Diagnosis (read-only — pair with a read-only run)

```xml
<task>
Diagnose why the failing test or command breaks in this repository. Identify
the most likely root cause.
</task>
<compact_output_contract>
Return: 1. most likely root cause 2. evidence 3. smallest safe next step
</compact_output_contract>
<missing_context_gating>
Do not guess missing repository facts. State exactly what remains unknown.
</missing_context_gating>
<verification_loop>
Before finalizing, verify the proposed root cause matches the observed evidence.
</verification_loop>
```

### Root-cause review

```xml
<task>
Analyze this change for the most likely correctness or regression issues.
Focus on the provided repository context only.
</task>
<structured_output_contract>
Return: 1. findings ordered by severity 2. supporting evidence per finding
3. brief next steps
</structured_output_contract>
<grounding_rules>
Ground every claim in the repo context or tool outputs. Label inferences.
</grounding_rules>
<dig_deeper_nudge>
Check second-order failures, empty-state handling, retries, stale state, and
rollback paths before finalizing.
</dig_deeper_nudge>
```

### Research / recommendation

```xml
<task>
Research the available options and recommend the best path for this task.
</task>
<structured_output_contract>
Return: 1. observed facts 2. reasoned recommendation 3. tradeoffs
4. open questions
</structured_output_contract>
<research_mode>
Separate observed facts, reasoned inferences, and open questions.
</research_mode>
```

## Prompt anti-patterns

These are about *prompt phrasing*, distinct from the brief-file anti-patterns
in `task-template.md` (vague goals, missing scope fence).

| Anti-pattern | Bad | Fix |
|---|---|---|
| Vague task framing | "Take a look at this and let me know what you think." | Wrap a concrete job in `<task>`. |
| Missing output contract | "Investigate and report back." | Add `structured_output_contract` or `compact_output_contract`. |
| No follow-through default | "Debug this failure." (Codex stalls asking) | Add `default_follow_through_policy`. |
| Asking for more reasoning | "Think harder and be very smart." | Add `verification_loop` — a contract beats a pep talk. |
| Mixing unrelated jobs | "Review this diff, fix the bug, update docs, suggest a roadmap." | One job per run; split into separate runs. |
| Unsupported certainty | "Tell me exactly why production failed." | Add `grounding_rules` so inferences stay labeled. |

## Cross-references

- `task-template.md` — the brief-file shape these blocks slot into
- `patterns.md` — the five delegation shapes (which recipe pairs with which)
- `review-checklist.md` — Claude's acceptance gate after the run
14 changes: 14 additions & 0 deletions skills/codex-delegate/references/model-selection.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,20 @@ Both produced correct, runnable code. The semantic difference is style: `gpt-5.5
- Wall-time pressure (interactive iteration, TDD-style loops).
- Time-to-first-byte matters more than the final-byte quality.

## The `spark` shorthand

`gpt-5.3-codex-spark` is a low-latency, low-cost Codex model. There is no alias
resolution in the wrapper — it passes `--model` straight through — so the
shorthand is a documentation convention: when the user (or a brief) says
**`spark`**, invoke the wrapper with `--model gpt-5.3-codex-spark`
(`-Model gpt-5.3-codex-spark` in PowerShell).

Reach for `spark` when the task is trivially mechanical and latency dominates —
a fast first pass, a TDD-style loop, or a sweep where even `gpt-5.4` is more
model than the edit needs. For anything where idiomatic style or subtle
correctness matters, stay on the default. Confirm the exact name your CLI
exposes with `codex models`.

## How to A/B another task in your project

```bash
Expand Down
8 changes: 7 additions & 1 deletion skills/codex-delegate/references/patterns.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,11 +132,17 @@ Steps:
2. Run review mode:

```bash
codex exec review --full-auto </dev/null
codex exec review --sandbox workspace-write </dev/null
```

3. Read the review output as a hint, not a verdict. Claude still owns the acceptance decision and runs verification.

**Do not auto-apply fixes from a review run.** After presenting the findings,
stop. Surface them to the user and ask which, if any, they want fixed before
touching a single file — even when a fix looks obvious. A review produces a
list of *candidate* issues, not an approved work order; turning it straight
into edits skips the user's prioritisation call.

Review mode is cheaper than re-running the full implementation pattern when you only want a sanity check.

---
Expand Down
8 changes: 8 additions & 0 deletions skills/codex-delegate/references/review-checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,11 @@ Before accepting a Codex run, Claude must verify each of these:
## Decision

If any answer is *no*, fix the task file or take the rest locally in Claude. Do not paper over a bad run.

## Review-mode runs (Pattern 5)

This checklist gates *acceptance of a delegated implementation*. A **review
mode** run (`patterns.md` Pattern 5) is different: it produces candidate
findings, not a diff to accept. After presenting those findings, stop — do not
auto-apply fixes. Ask the user which issues they want addressed before
touching any file, even when a fix looks obvious.
12 changes: 12 additions & 0 deletions skills/codex-delegate/references/task-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,15 @@ Save the brief at `.ai/codex_task_<name>.md`. If the task is part of a multi-age
- Missing scope fence (Codex will then "improve" unrelated files)
- Acceptance written as prose instead of executable verification
- Asking Codex to "decide" between alternatives instead of just executing

These are the failure modes of the *brief file*. The failure modes of the
*task phrasing* (missing output contract, "think harder", mixing unrelated
jobs) are separate — see `codex-prompt-blocks.md`.

## Judgment-sensitive tasks

For debugging, write-capable changes, review, or research, the flat template
above is not enough on its own. Add the composable XML prompt blocks
(`<verification_loop>`, `<grounding_rules>`, `<action_safety>`, …) from
`codex-prompt-blocks.md` to the Goal/Constraints. A pure mechanical sweep
does not need them.
Loading