WenyuChiou · WenyuChiou · May 10, 2026 · May 10, 2026
diff --git a/skills/codex-delegate/SKILL.md b/skills/codex-delegate/SKILL.md
@@ -48,7 +48,7 @@ Full routing table and good/bad examples: `references/delegation-targets.md`.
 ## Compatibility
 
 - Tested with `@openai/codex` 0.128.0 (May 2026). Should work with any version that accepts `codex exec --sandbox workspace-write`.
-- Default model: `gpt-5.4` (override via `--model` or `-Model`). Other models on your CLI: see `codex models`.
+- Default model: `gpt-5.4` (override via `--model` or `-Model`). `gpt-5.5` is also available on `codex-cli` 0.128.0+ and produces more idiomatic output at ~3× the token cost; trade-offs and an A/B-test recipe live in `references/model-selection.md`. Other models on your CLI: see `codex models`.
 - Wrapper calls `codex exec --sandbox workspace-write -C <repo> -m <model>`. The older `--full-auto` flag is deprecated in 0.128+ and was replaced.
 - `codex exec` runs in non-interactive mode and auto-approves (no `--ask-for-approval` flag exists on `exec`; that flag is top-level only).
 - Direct `codex exec` calls must close stdin (`</dev/null`) to avoid the historical hang (issue #20919).
@@ -64,3 +64,4 @@ Full routing table and good/bad examples: `references/delegation-targets.md`.
 - `references/patterns.md` — five single-task delegation shapes (context file, parallel, resume, structured output, review mode)
 - `references/multi-agent.md` — leaf role in router/leaves architecture; when to route through `research-hub-multi-ai` or `agent-task-splitter`
 - `references/examples.md` — concrete invocation examples on `codex-cli` 0.128.0+ syntax
+- `references/model-selection.md` — `gpt-5.4` vs `gpt-5.5` trade-offs (A/B-tested) + when to override the default
diff --git a/skills/codex-delegate/references/model-selection.md b/skills/codex-delegate/references/model-selection.md
@@ -0,0 +1,55 @@
+# Model Selection
+
+The wrapper defaults to `-m gpt-5.4`. As of `codex-cli` 0.128.0, `gpt-5.5` is also available and works correctly. **The default has not been changed** because the choice is not free — see the trade-off table below — and the right choice depends on the task. Override per-call with `--model gpt-5.5` (bash wrapper) or `-Model gpt-5.5` (PowerShell), or pin a different default in `~/.codex/config.toml`:
+
+```toml
+[model]
+default = "gpt-5.5"
+```
+
+## Trade-off snapshot
+
+A/B run on a single `codex-delegate` invocation, identical prompt, fresh repo each side. Prompt: *"Write a Python function `fibonacci(n)` that returns the nth Fibonacci number using memoization. Include a 1-line docstring and a single inline comment explaining the base case. Output ONLY the function definition, no test code, no explanation."*
+
+| Metric | `gpt-5.4` | `gpt-5.5` |
+|---|---|---|
+| Wall time | 12.4 s | 15.7 s (+27%) |
+| Tokens used | 6,962 | 21,432 (×3.1) |
+| Output | uses mutable default arg (`memo={0:0,1:1}`) — works but a known Python pitfall | closure with inner `_fib`, fresh `memo = {}` per outer call — idiomatically cleaner |
+| `status` in `result.json` | `success` | `success` |
+
+Both produced correct, runnable code. The semantic difference is style: `gpt-5.5` produced more idiomatic Python at significantly higher cost.
+
+## When to opt in to `gpt-5.5`
+
+- The output will be read by humans (production refactor, code-review prep, library code that ships).
+- Idiomatic style or subtle correctness (closure vs mutable default, generator vs list, dataclass vs dict, etc.) matters more than throughput.
+- The task is one-shot, not a sweep — the 3× token cost doesn't multiply across many calls.
+
+## When to keep `gpt-5.4` (current default)
+
+- Mechanical sweeps across many files (the wrapper's main use case): boilerplate, batch edits, scaffolding, test harness generation. Token cost compounds quickly across N calls; gpt-5.4 is enough for these.
+- Token budget pressure (you are running parallel delegate sessions or the user has quota concerns).
+- Wall-time pressure (interactive iteration, TDD-style loops).
+- Time-to-first-byte matters more than the final-byte quality.
+
+## How to A/B another task in your project
+
+```bash
+mkdir -p /tmp/codex-ab-test/{a,b}
+PROMPT="<your task>"
+
+bash ~/.claude/skills/codex-delegate/scripts/run_codex.sh \
+  --prompt "$PROMPT" --repo /tmp/codex-ab-test/a \
+  --log-file /tmp/codex-ab-test/a/log.txt --model gpt-5.4
+
+bash ~/.claude/skills/codex-delegate/scripts/run_codex.sh \
+  --prompt "$PROMPT" --repo /tmp/codex-ab-test/b \
+  --log-file /tmp/codex-ab-test/b/log.txt --model gpt-5.5
+
+diff /tmp/codex-ab-test/{a,b}/log.txt
+cat /tmp/codex-ab-test/a/log.txt.result.json
+cat /tmp/codex-ab-test/b/log.txt.result.json
+```
+
+Inspect the wall-time, `tokens used`, and the actual generated code. Whatever pattern you see in your representative task should drive the model choice for that pattern, not the lab-style rubric above.