Context
Recent Ralph runs show that the expensive loop is also the reason output quality is high:
implementation + full reviewer loop + broad validation + workflow bookkeeping
The goal should not be to remove the loop. The goal should be to reduce avoidable re-review cycles by moving recurring reviewer objections earlier into the worker phase, while preserving independent review, re-review, validation, and receipts.
The observed high-value review findings were mostly not cosmetic. They caught false confidence:
- Tests that passed but did not hit the intended branch.
- Assertions that used broad
Contains checks instead of exact behavior.
- Platform or version evidence that was claimed but not actually proven.
- Numeric precision cases that looked valid but were not exactly representable.
- Locale/calendar variants missed by the first implementation.
- Coverage tasks where uncovered branches needed either real stimuli or explicit unreachable proof.
Findings From Recent Ralph Runs
This proposal comes from analyzing the last two Ralph runs on the Humanizer repo, not from a general desire to make the loop faster. The data shows that the long duration was mostly active quality work, and that the slowest cycles were exactly the ones where review found gaps between "tests exist" and "the intended behavior is actually proven."
Run-Level Data
| Run |
Scope |
Iterations |
Timeout events |
Known turns |
Instrumented active minutes* |
Outcome |
20260412-220502-e6d1 |
Finished most of Urdu locale epic, then started coverage epic |
25 |
3 |
1,252 |
790.5 |
Hit MAX_ITERATIONS while task .9 was still in progress |
20260413-182801-22c6 |
Finished remaining coverage tasks and completion review, then started fn-10 plan review |
10 |
1 |
436 |
307.4 |
Hit plan-review infra blocker because RP was unavailable |
| Total |
Two Ralph runs |
35 |
4 |
1,688 |
1,097.9 min / 18.3 hr |
Work mostly completed; remaining blocker was infrastructure/workflow |
* Instrumented active minutes are from per-iteration duration_ms plus 60 minutes for worker timeouts. Wall-clock elapsed was higher because run start/end spans include gaps and process overhead.
Concentration of Cost
The cost was concentrated in a small number of high-turn iterations:
| Iteration |
Task / phase |
Turns |
Minutes |
What happened |
| Run 1 iter 8 |
fn-8.10 |
186 |
37.6 |
Hijri calendar support needed review fixes for independent month emission, UmAlQuraCalendar, bidi validation, and exact per-month outputs. |
| Run 1 iter 11 |
fn-8.12 |
151 |
58.5 |
Large Urdu locale validation surface; high turn count indicates broad code/test/review/validation sweep. |
| Run 1 iter 16 |
fn-8 completion review |
123 |
32.4 |
Epic-level review verified the combined Urdu implementation rather than per-task slices. |
| Run 1 iter 19 |
fn-9.3 |
113 |
21.2 |
Coverage-gap task with review/validation work. |
| Run 1 iter 4 |
fn-8.3 |
104 |
23.0 |
Re-ran after timeout; review had found a real negative ordinal bug and missing high-scale/parse coverage. |
| Run 2 iter 5 |
fn-9.14 |
119 |
55.9 |
WordsToNumber/TimeSpan branch coverage needed multiple fixes because some tests did not reach the claimed branches. |
| Run 2 iter 2 |
fn-9.10 |
75 |
52.6 |
Core extension coverage needed exact output assertions and real error-branch tests. |
| Run 2 iter 9 |
fn-9 completion review |
73 |
21.2 |
Completion review found workflow/spec evidence issues after all tasks were marked done. |
Across both runs:
- 14 iterations had >=50 turns, consuming 560.1 active minutes.
- 6 iterations had >=100 turns, consuming 228.7 active minutes.
- Median iteration duration was 31.5 minutes; p75 was 52.0 minutes.
- 4 worker timeouts consumed about 240 minutes by themselves.
This is the problem to optimize: not that review exists, but that the worker often entered review without a compact proof of what it thought it had proven. The reviewer then had to discover weak assertions, wrong branch stimuli, missing variant proof, and evidence overclaims late in the loop.
What Later Cycles Actually Fixed
The extra cycles produced meaningful correctness improvements:
| Area |
Later-cycle finding |
Concrete correction |
Why it matters |
Urdu negative ordinals (fn-8.3) |
Negative ordinal path appended suffixes before checking exact replacements. |
Fixed negative ordinals to consult exact replacements first; added tests for -1, -5, high-scale cardinal output, and negative parse round-trip. |
Real behavior bug. A green build and basic locale tests would not have caught it. |
Hijri calendar (fn-8.10) |
Tests covered Hijri conceptually but missed the runtime UmAlQuraCalendar variant. |
Changed detection to include UmAlQuraCalendar; added exact-output month tests. |
Users can hit this calendar implementation at runtime. Fragment tests were insufficient. |
Urdu regional test matrix (fn-8.6) |
Review found missing 101 feminine ordinal, string neuter fallback, bidi sweep gaps, and missing UseCulture. |
Added targeted tests and culture binding. |
This is locale correctness, not style. Missing culture binding can make tests pass for the wrong reason. |
Coverage baseline (fn-9.1) |
Local macOS artifacts were being treated too close to canonical baseline evidence and omitted net48. |
Reworked artifact status/placement, added real baseline generation structure, compacted huge uncovered.json, documented local baseline as informational. |
Prevented downstream tasks from using a misleading baseline. |
Dual Roslyn analyzers (fn-9.7) |
Tests could have validated the old analyzer path or stale assets while claiming Roslyn 4.14 coverage. |
Added isolated intermediate outputs, explicit restore/build behavior, and package-version assertions. |
This is a classic false-proof problem: tests pass but do not prove the intended compatibility arm. |
Engine factory coverage (fn-9.9) |
Initial tests did not cover enough factory dispatch/value-resolution/token-map branches. |
Added scope assertions, reflection-based token-map tests, defensive exception tests, value-resolution tests, and broad factory dispatch coverage. |
Branch-coverage task required a branch map, not just more tests. |
ByteSize/Truncator (fn-9.12) |
Test used long.MaxValue through double, which is not exactly representable. |
Switched to 1L << 53, renamed the test to state the precision contract, strengthened exact assertions. |
Avoided a mathematically misleading test. |
WordsToNumber/TimeSpan (fn-9.14) |
Several tests did not exercise the claimed branches, including Filipino hyphenated vs glued token forms and misunderstood TimeSpan overflow/max-unit paths. |
Added branch-accurate stimuli, exact TimeSpan expected strings, exception message checks, out-param checks, and synthetic fixture for an otherwise unreachable ShouldIgnore path. |
This is the strongest evidence for the proposal: the reviewer repeatedly had to reverse-engineer whether tests actually hit intended branches. |
Coverage gate (fn-9.11) |
Malformed/missing XML tests needed stronger assertions. |
Strengthened assertions and task evidence. |
Reduces chance of a gate script silently accepting bad inputs. |
Completion review (fn-9) |
Epic-level review found task/state/spec evidence drift after per-task SHIP verdicts. |
Synced task .11 status and documented task .6 null-guard scope amendment. |
Confirms completion review is useful and should not be removed. |
Failure / Churn Modes Observed
These are the specific recurring problems the Quality Proof Packet targets:
- Branch stimulus ambiguity: tests existed, but it was unclear or false that they reached the intended branch.
- Weak observable assertions:
Contains or broad checks passed without proving exact behavior.
- Variant proof gaps: platform/calendar/Roslyn-version variants were claimed without direct evidence.
- Evidence overclaiming: local artifacts were treated as canonical or CI-equivalent before they were.
- Unreachable branch ambiguity: reviewer had to ask whether a branch was truly unreachable or merely untested.
- Reviewer reconstruction cost: every review had to reconstruct the worker's intended coverage map from code, tests, and prose.
- Workflow/infrastructure churn: missing receipts, session mismatch, and RP unavailability caused retries that did not improve code quality.
The current Ralph quality model is directionally right. The loop is slow because it keeps forcing claims to become proof. The opportunity is to make the worker present stronger proof before the first review, then let the reviewer verify that proof independently.
Estimated Impact
These estimates assume no reduction in review strictness, no removal of completion review, and no weaker validation gates.
| Change |
Expected time impact |
Expected token impact |
Why |
| Quality Proof Packet before first review |
10-20% overall runtime reduction on broad Ralph runs; 20-35% reduction on coverage/localization/analyzer-heavy tasks |
10-25% fewer review/fix-loop tokens on high-churn tasks |
Avoids at least one reviewer-discovered weak-test/branch-proof cycle on tasks like fn-9.12, fn-9.14, fn-9.7, fn-8.10. |
| Branch/variant proof targets in task specs |
5-10% overall, higher on coverage epics |
5-15% fewer implementation/review tokens |
Worker starts with explicit proof obligations instead of inferring them after review feedback. |
| Typed review categories |
Small immediate runtime win; enables later policy |
5-10% on re-review prompts |
Lets fix loops focus on blocking correctness/test/evidence categories and stop spending tokens on unstructured issue parsing. |
Evidence/proof rendering in flowctl done |
Small runtime win |
Small token win |
Completion review can inspect proof/evidence directly instead of reconstructing why a task was considered done. |
Ralph events.jsonl analytics |
No direct runtime win |
No direct token win |
Makes future bottleneck analysis automatic instead of forensic. |
| Workflow/session/receipt cleanup |
5-15% reduction in wasted retries when guard/session mismatches occur |
5-10% fewer retry tokens |
Targets churn like missing receipt resets and manual receipt/state mismatch, without touching code-quality gates. |
Reasonable expectation for a Humanizer-sized run:
- From the observed 18.3 instrumented active hours across the two runs, a conservative 10-20% improvement is 1.8-3.7 hours saved.
- On high-churn iterations only, observed 560 active minutes across iterations with >=50 turns. A 20-35% reduction there is 1.9-3.3 hours saved.
- Token savings should mostly come from fewer repeated review/fix turns, not shorter first-pass implementation. For high-churn tasks, expect 10-25% fewer review-loop tokens. Across the full run, expect 5-15% fewer total tokens, because implementation, tests, and broad validation still remain necessary.
The estimate is intentionally bounded. This proposal should make hard tasks less repetitive, not cheap. The quality-producing loop stays intact.
Proposal
Add a first-class Quality Proof Packet for each task.
The worker creates this artifact after implementation and targeted validation, before /flow-next:impl-review. The reviewer treats it as untrusted implementation claims and verifies it against code, diff, tests, and spec.
Example shape:
# Quality Proof: fn-9.14
## Scope
Implemented WordsToNumber and TimeSpan branch coverage for ...
## Spec Coverage
| Acceptance item | Evidence |
|---|---|
| Handles invalid ordinal abbreviation | Test X asserts parsedNumber=0 and unrecognized word |
## Branch / Variant Proof
| Target | Stimulus | Observable assertion | Why this hits target |
|---|---|---|---|
| Filipino teen prefix branch | "labingwalo" | returns 18 | glued form enters prefix branch; hyphenated form does not |
## Assertion Quality
- No behavior tests rely on broad `Contains` unless substring behavior is the contract.
- Error paths assert out params / exception type / exception message where relevant.
- Culture-sensitive outputs assert exact strings.
## Unreachable / Deferred
| Target | Reason |
|---|---|
| Greedy profile ShouldIgnore production path | No production profile currently configures ignored tokens; synthetic fixture added |
## Validation
- `dotnet test ... --filter ...` passed
- Coverage target improved from X to Y, or explicit unreachable proof supplied
Implementation Plan
1. Add proof guidance to planning
Update:
plugins/flow-next/skills/flow-next-plan/steps.md
Generated task specs may include:
## Correctness proof targets
| Target | Required proof |
|---|---|
| Branch/variant | Stimulus + exact observable assertion |
| Platform/version | Evidence command or explicit boundary |
| Error path | Exception/out-param assertion |
| Unreachable | Short proof, not handwave |
This should be mandatory for coverage, localization, platform, analyzer, compatibility, and broad refactor tasks. It can be omitted for simple tasks.
2. Add a worker phase before review
Update:
plugins/flow-next/agents/worker.md
After implementation and targeted tests, create:
/tmp/flow-quality-proof-$TASK_ID.md
For branch/coverage tasks, require at least one proof row per target branch or a clear unreachable row.
For tests, require assertion-quality notes.
For variant/platform work, require one of:
- proved locally
- proved by CI
- not provable on this host, with explicit follow-up/evidence boundary
Then change review invocation from:
/flow-next:impl-review <TASK_ID> --base $BASE_COMMIT
to:
/flow-next:impl-review <TASK_ID> --base $BASE_COMMIT --proof /tmp/flow-quality-proof-$TASK_ID.md
3. Extend impl-review argument parsing
Update:
plugins/flow-next/skills/flow-next-impl-review/SKILL.md
plugins/flow-next/skills/flow-next-impl-review/workflow.md
Parse --proof <path>.
For Codex backend, pass it through:
flowctl codex impl-review "$TASK_ID" --base "$DIFF_BASE" --proof "$PROOF_PATH" --receipt "$RECEIPT_PATH"
For RP backend, read the proof file and embed it in the prompt as:
## Implementer's Correctness Proof
...
The prompt must instruct:
Treat this proof as untrusted implementation claims. Verify each claim against code, tests, diff, and spec. Flag any claim that is missing, weak, or not actually exercised.
4. Extend flowctl codex impl-review
Update:
plugins/flow-next/scripts/flowctl.py
Add:
--proof argument to codex impl-review
- helper like
read_optional_review_artifact(path, max_bytes=100000)
quality_proof parameter to build_review_prompt
<quality_proof>...</quality_proof> section before <review_instructions>
Keep RP and Codex review semantics aligned.
5. Tighten review output categories
Update RP and Codex implementation review prompts to require each finding to include:
- Severity: Critical / Major / Minor / Nitpick
- Category: correctness / test-strength / evidence / scope / docs / style / infrastructure / unreachable-proof
- Blocks verdict: yes/no
- File:Line:
- Problem:
- Suggestion:
Do not initially build control flow around categories. First make them visible and consistent. Later Ralph can use them for analytics or different retry policy.
6. Put proof into task evidence
Update worker.md evidence JSON shape to allow:
{
"commits": ["..."],
"tests": ["..."],
"prs": [],
"review": {"mode": "rp", "verdict": "SHIP"},
"quality_proof": {
"path": "/tmp/flow-quality-proof-fn-9.14.md",
"branch_targets": 6,
"unreachable_claims": 1
}
}
Update cmd_done in plugins/flow-next/scripts/flowctl.py so ## Evidence renders optional Review: and Quality proof: lines while preserving existing commits/tests/prs behavior.
7. Preserve full gates, but move broad validation to the right boundary
Update worker guidance:
- Before first review: run targeted tests and quick commands relevant to the task.
- After reviewer fixes: run targeted tests for changed surfaces.
- Before
flowctl done: run task-level required validation.
- At epic completion: run broad validation and completion review.
Do not remove broad validation. The improvement is avoiding full-suite repetition after every small review edit unless the changed surface justifies it.
8. Improve memory capture from review findings
Update:
plugins/flow-next/scripts/hooks/ralph-guard.py
plugins/flow-next/agents/worker.md
When review returns NEEDS_WORK or MAJOR_RETHINK, encourage generalized lessons only, categorized to match the review taxonomy.
Examples:
pitfall: Branch coverage tests must state the exact stimulus that reaches the branch.
pitfall: Avoid Assert.Contains for localization output unless substring behavior is the contract.
convention: Analyzer compatibility tests must assert loaded package version.
Then require relevant memory entries to appear in the next Quality Proof Packet under Relevant memory applied.
9. Add Ralph analytics without changing behavior
Update:
plugins/flow-next/skills/flow-next-ralph-init/templates/ralph.sh
Append a lightweight events.jsonl per iteration:
{"iter":7,"phase":"work","id":"fn-9.14","started_at":"...","ended_at":"...","rc":0,"timeout":false,"verdict":"SHIP","receipt_ok":true,"task_status":"done"}
This makes future runtime analysis mechanical instead of forensic.
Tests
Update plugins/flow-next/scripts/smoke_test.sh:
build_review_prompt includes <quality_proof>.
- Proof text is marked untrusted.
codex impl-review --help includes --proof.
cmd_done renders optional review and quality_proof evidence fields.
Update plugins/flow-next/scripts/ci_test.sh:
- Evidence JSON with extra keys remains accepted.
- Rendered evidence includes commits/tests/prs plus optional review/proof.
- Existing evidence JSON remains backward-compatible.
Update plugins/flow-next/scripts/ralph_smoke_test.sh:
- Stub worker writes a proof file.
- Iter log includes
/flow-next:impl-review ... --proof.
- Receipts still pass existing schema.
RP e2e tests should not need semantic reviewer changes, only confirmation that the prompt path still works and receipt gates remain intact.
Non-goals
- Do not remove independent review.
- Do not reduce mandatory re-review for substantive fixes.
- Do not weaken completion review.
- Do not make receipt validation optional.
- Do not lower validation expectations.
Expected Effect
This should not make hard tasks cheap. The expensive loop remains because it is the quality mechanism.
The expected win is fewer avoidable review rounds:
- Fewer "test does not hit intended branch" findings.
- Fewer weak assertion findings.
- Fewer platform/version overclaim findings.
- Fewer reviewer cycles spent reconstructing what the worker thought it proved.
- Better future behavior as generalized review lessons accumulate in memory.
Context
Recent Ralph runs show that the expensive loop is also the reason output quality is high:
implementation + full reviewer loop + broad validation + workflow bookkeepingThe goal should not be to remove the loop. The goal should be to reduce avoidable re-review cycles by moving recurring reviewer objections earlier into the worker phase, while preserving independent review, re-review, validation, and receipts.
The observed high-value review findings were mostly not cosmetic. They caught false confidence:
Containschecks instead of exact behavior.Findings From Recent Ralph Runs
This proposal comes from analyzing the last two Ralph runs on the Humanizer repo, not from a general desire to make the loop faster. The data shows that the long duration was mostly active quality work, and that the slowest cycles were exactly the ones where review found gaps between "tests exist" and "the intended behavior is actually proven."
Run-Level Data
20260412-220502-e6d1MAX_ITERATIONSwhile task.9was still in progress20260413-182801-22c6* Instrumented active minutes are from per-iteration
duration_msplus 60 minutes for worker timeouts. Wall-clock elapsed was higher because run start/end spans include gaps and process overhead.Concentration of Cost
The cost was concentrated in a small number of high-turn iterations:
fn-8.10UmAlQuraCalendar, bidi validation, and exact per-month outputs.fn-8.12fn-8completion reviewfn-9.3fn-8.3fn-9.14fn-9.10fn-9completion reviewAcross both runs:
This is the problem to optimize: not that review exists, but that the worker often entered review without a compact proof of what it thought it had proven. The reviewer then had to discover weak assertions, wrong branch stimuli, missing variant proof, and evidence overclaims late in the loop.
What Later Cycles Actually Fixed
The extra cycles produced meaningful correctness improvements:
fn-8.3)-1,-5, high-scale cardinal output, and negative parse round-trip.fn-8.10)UmAlQuraCalendarvariant.UmAlQuraCalendar; added exact-output month tests.fn-8.6)101feminine ordinal, string neuter fallback, bidi sweep gaps, and missingUseCulture.fn-9.1)net48.uncovered.json, documented local baseline as informational.fn-9.7)fn-9.9)fn-9.12)long.MaxValuethroughdouble, which is not exactly representable.1L << 53, renamed the test to state the precision contract, strengthened exact assertions.fn-9.14)ShouldIgnorepath.fn-9.11)fn-9).11status and documented task.6null-guard scope amendment.Failure / Churn Modes Observed
These are the specific recurring problems the Quality Proof Packet targets:
Containsor broad checks passed without proving exact behavior.The current Ralph quality model is directionally right. The loop is slow because it keeps forcing claims to become proof. The opportunity is to make the worker present stronger proof before the first review, then let the reviewer verify that proof independently.
Estimated Impact
These estimates assume no reduction in review strictness, no removal of completion review, and no weaker validation gates.
fn-9.12,fn-9.14,fn-9.7,fn-8.10.flowctl doneevents.jsonlanalyticsReasonable expectation for a Humanizer-sized run:
The estimate is intentionally bounded. This proposal should make hard tasks less repetitive, not cheap. The quality-producing loop stays intact.
Proposal
Add a first-class Quality Proof Packet for each task.
The worker creates this artifact after implementation and targeted validation, before
/flow-next:impl-review. The reviewer treats it as untrusted implementation claims and verifies it against code, diff, tests, and spec.Example shape:
Implementation Plan
1. Add proof guidance to planning
Update:
plugins/flow-next/skills/flow-next-plan/steps.mdGenerated task specs may include:
This should be mandatory for coverage, localization, platform, analyzer, compatibility, and broad refactor tasks. It can be omitted for simple tasks.
2. Add a worker phase before review
Update:
plugins/flow-next/agents/worker.mdAfter implementation and targeted tests, create:
/tmp/flow-quality-proof-$TASK_ID.mdFor branch/coverage tasks, require at least one proof row per target branch or a clear unreachable row.
For tests, require assertion-quality notes.
For variant/platform work, require one of:
Then change review invocation from:
to:
3. Extend impl-review argument parsing
Update:
plugins/flow-next/skills/flow-next-impl-review/SKILL.mdplugins/flow-next/skills/flow-next-impl-review/workflow.mdParse
--proof <path>.For Codex backend, pass it through:
For RP backend, read the proof file and embed it in the prompt as:
## Implementer's Correctness Proof ...The prompt must instruct:
4. Extend
flowctl codex impl-reviewUpdate:
plugins/flow-next/scripts/flowctl.pyAdd:
--proofargument tocodex impl-reviewread_optional_review_artifact(path, max_bytes=100000)quality_proofparameter tobuild_review_prompt<quality_proof>...</quality_proof>section before<review_instructions>Keep RP and Codex review semantics aligned.
5. Tighten review output categories
Update RP and Codex implementation review prompts to require each finding to include:
Do not initially build control flow around categories. First make them visible and consistent. Later Ralph can use them for analytics or different retry policy.
6. Put proof into task evidence
Update
worker.mdevidence JSON shape to allow:{ "commits": ["..."], "tests": ["..."], "prs": [], "review": {"mode": "rp", "verdict": "SHIP"}, "quality_proof": { "path": "/tmp/flow-quality-proof-fn-9.14.md", "branch_targets": 6, "unreachable_claims": 1 } }Update
cmd_doneinplugins/flow-next/scripts/flowctl.pyso## Evidencerenders optionalReview:andQuality proof:lines while preserving existingcommits/tests/prsbehavior.7. Preserve full gates, but move broad validation to the right boundary
Update worker guidance:
flowctl done: run task-level required validation.Do not remove broad validation. The improvement is avoiding full-suite repetition after every small review edit unless the changed surface justifies it.
8. Improve memory capture from review findings
Update:
plugins/flow-next/scripts/hooks/ralph-guard.pyplugins/flow-next/agents/worker.mdWhen review returns
NEEDS_WORKorMAJOR_RETHINK, encourage generalized lessons only, categorized to match the review taxonomy.Examples:
pitfall: Branch coverage tests must state the exact stimulus that reaches the branch.pitfall: Avoid Assert.Contains for localization output unless substring behavior is the contract.convention: Analyzer compatibility tests must assert loaded package version.Then require relevant memory entries to appear in the next Quality Proof Packet under
Relevant memory applied.9. Add Ralph analytics without changing behavior
Update:
plugins/flow-next/skills/flow-next-ralph-init/templates/ralph.shAppend a lightweight
events.jsonlper iteration:{"iter":7,"phase":"work","id":"fn-9.14","started_at":"...","ended_at":"...","rc":0,"timeout":false,"verdict":"SHIP","receipt_ok":true,"task_status":"done"}This makes future runtime analysis mechanical instead of forensic.
Tests
Update
plugins/flow-next/scripts/smoke_test.sh:build_review_promptincludes<quality_proof>.codex impl-review --helpincludes--proof.cmd_donerenders optionalreviewandquality_proofevidence fields.Update
plugins/flow-next/scripts/ci_test.sh:Update
plugins/flow-next/scripts/ralph_smoke_test.sh:/flow-next:impl-review ... --proof.RP e2e tests should not need semantic reviewer changes, only confirmation that the prompt path still works and receipt gates remain intact.
Non-goals
Expected Effect
This should not make hard tasks cheap. The expensive loop remains because it is the quality mechanism.
The expected win is fewer avoidable review rounds: