Skip to content

Reduce Ralph review churn with task quality proof packets #111

@clairernovotny

Description

@clairernovotny

Context

Recent Ralph runs show that the expensive loop is also the reason output quality is high:

implementation + full reviewer loop + broad validation + workflow bookkeeping

The goal should not be to remove the loop. The goal should be to reduce avoidable re-review cycles by moving recurring reviewer objections earlier into the worker phase, while preserving independent review, re-review, validation, and receipts.

The observed high-value review findings were mostly not cosmetic. They caught false confidence:

  • Tests that passed but did not hit the intended branch.
  • Assertions that used broad Contains checks instead of exact behavior.
  • Platform or version evidence that was claimed but not actually proven.
  • Numeric precision cases that looked valid but were not exactly representable.
  • Locale/calendar variants missed by the first implementation.
  • Coverage tasks where uncovered branches needed either real stimuli or explicit unreachable proof.

Findings From Recent Ralph Runs

This proposal comes from analyzing the last two Ralph runs on the Humanizer repo, not from a general desire to make the loop faster. The data shows that the long duration was mostly active quality work, and that the slowest cycles were exactly the ones where review found gaps between "tests exist" and "the intended behavior is actually proven."

Run-Level Data

Run Scope Iterations Timeout events Known turns Instrumented active minutes* Outcome
20260412-220502-e6d1 Finished most of Urdu locale epic, then started coverage epic 25 3 1,252 790.5 Hit MAX_ITERATIONS while task .9 was still in progress
20260413-182801-22c6 Finished remaining coverage tasks and completion review, then started fn-10 plan review 10 1 436 307.4 Hit plan-review infra blocker because RP was unavailable
Total Two Ralph runs 35 4 1,688 1,097.9 min / 18.3 hr Work mostly completed; remaining blocker was infrastructure/workflow

* Instrumented active minutes are from per-iteration duration_ms plus 60 minutes for worker timeouts. Wall-clock elapsed was higher because run start/end spans include gaps and process overhead.

Concentration of Cost

The cost was concentrated in a small number of high-turn iterations:

Iteration Task / phase Turns Minutes What happened
Run 1 iter 8 fn-8.10 186 37.6 Hijri calendar support needed review fixes for independent month emission, UmAlQuraCalendar, bidi validation, and exact per-month outputs.
Run 1 iter 11 fn-8.12 151 58.5 Large Urdu locale validation surface; high turn count indicates broad code/test/review/validation sweep.
Run 1 iter 16 fn-8 completion review 123 32.4 Epic-level review verified the combined Urdu implementation rather than per-task slices.
Run 1 iter 19 fn-9.3 113 21.2 Coverage-gap task with review/validation work.
Run 1 iter 4 fn-8.3 104 23.0 Re-ran after timeout; review had found a real negative ordinal bug and missing high-scale/parse coverage.
Run 2 iter 5 fn-9.14 119 55.9 WordsToNumber/TimeSpan branch coverage needed multiple fixes because some tests did not reach the claimed branches.
Run 2 iter 2 fn-9.10 75 52.6 Core extension coverage needed exact output assertions and real error-branch tests.
Run 2 iter 9 fn-9 completion review 73 21.2 Completion review found workflow/spec evidence issues after all tasks were marked done.

Across both runs:

  • 14 iterations had >=50 turns, consuming 560.1 active minutes.
  • 6 iterations had >=100 turns, consuming 228.7 active minutes.
  • Median iteration duration was 31.5 minutes; p75 was 52.0 minutes.
  • 4 worker timeouts consumed about 240 minutes by themselves.

This is the problem to optimize: not that review exists, but that the worker often entered review without a compact proof of what it thought it had proven. The reviewer then had to discover weak assertions, wrong branch stimuli, missing variant proof, and evidence overclaims late in the loop.

What Later Cycles Actually Fixed

The extra cycles produced meaningful correctness improvements:

Area Later-cycle finding Concrete correction Why it matters
Urdu negative ordinals (fn-8.3) Negative ordinal path appended suffixes before checking exact replacements. Fixed negative ordinals to consult exact replacements first; added tests for -1, -5, high-scale cardinal output, and negative parse round-trip. Real behavior bug. A green build and basic locale tests would not have caught it.
Hijri calendar (fn-8.10) Tests covered Hijri conceptually but missed the runtime UmAlQuraCalendar variant. Changed detection to include UmAlQuraCalendar; added exact-output month tests. Users can hit this calendar implementation at runtime. Fragment tests were insufficient.
Urdu regional test matrix (fn-8.6) Review found missing 101 feminine ordinal, string neuter fallback, bidi sweep gaps, and missing UseCulture. Added targeted tests and culture binding. This is locale correctness, not style. Missing culture binding can make tests pass for the wrong reason.
Coverage baseline (fn-9.1) Local macOS artifacts were being treated too close to canonical baseline evidence and omitted net48. Reworked artifact status/placement, added real baseline generation structure, compacted huge uncovered.json, documented local baseline as informational. Prevented downstream tasks from using a misleading baseline.
Dual Roslyn analyzers (fn-9.7) Tests could have validated the old analyzer path or stale assets while claiming Roslyn 4.14 coverage. Added isolated intermediate outputs, explicit restore/build behavior, and package-version assertions. This is a classic false-proof problem: tests pass but do not prove the intended compatibility arm.
Engine factory coverage (fn-9.9) Initial tests did not cover enough factory dispatch/value-resolution/token-map branches. Added scope assertions, reflection-based token-map tests, defensive exception tests, value-resolution tests, and broad factory dispatch coverage. Branch-coverage task required a branch map, not just more tests.
ByteSize/Truncator (fn-9.12) Test used long.MaxValue through double, which is not exactly representable. Switched to 1L << 53, renamed the test to state the precision contract, strengthened exact assertions. Avoided a mathematically misleading test.
WordsToNumber/TimeSpan (fn-9.14) Several tests did not exercise the claimed branches, including Filipino hyphenated vs glued token forms and misunderstood TimeSpan overflow/max-unit paths. Added branch-accurate stimuli, exact TimeSpan expected strings, exception message checks, out-param checks, and synthetic fixture for an otherwise unreachable ShouldIgnore path. This is the strongest evidence for the proposal: the reviewer repeatedly had to reverse-engineer whether tests actually hit intended branches.
Coverage gate (fn-9.11) Malformed/missing XML tests needed stronger assertions. Strengthened assertions and task evidence. Reduces chance of a gate script silently accepting bad inputs.
Completion review (fn-9) Epic-level review found task/state/spec evidence drift after per-task SHIP verdicts. Synced task .11 status and documented task .6 null-guard scope amendment. Confirms completion review is useful and should not be removed.

Failure / Churn Modes Observed

These are the specific recurring problems the Quality Proof Packet targets:

  1. Branch stimulus ambiguity: tests existed, but it was unclear or false that they reached the intended branch.
  2. Weak observable assertions: Contains or broad checks passed without proving exact behavior.
  3. Variant proof gaps: platform/calendar/Roslyn-version variants were claimed without direct evidence.
  4. Evidence overclaiming: local artifacts were treated as canonical or CI-equivalent before they were.
  5. Unreachable branch ambiguity: reviewer had to ask whether a branch was truly unreachable or merely untested.
  6. Reviewer reconstruction cost: every review had to reconstruct the worker's intended coverage map from code, tests, and prose.
  7. Workflow/infrastructure churn: missing receipts, session mismatch, and RP unavailability caused retries that did not improve code quality.

The current Ralph quality model is directionally right. The loop is slow because it keeps forcing claims to become proof. The opportunity is to make the worker present stronger proof before the first review, then let the reviewer verify that proof independently.

Estimated Impact

These estimates assume no reduction in review strictness, no removal of completion review, and no weaker validation gates.

Change Expected time impact Expected token impact Why
Quality Proof Packet before first review 10-20% overall runtime reduction on broad Ralph runs; 20-35% reduction on coverage/localization/analyzer-heavy tasks 10-25% fewer review/fix-loop tokens on high-churn tasks Avoids at least one reviewer-discovered weak-test/branch-proof cycle on tasks like fn-9.12, fn-9.14, fn-9.7, fn-8.10.
Branch/variant proof targets in task specs 5-10% overall, higher on coverage epics 5-15% fewer implementation/review tokens Worker starts with explicit proof obligations instead of inferring them after review feedback.
Typed review categories Small immediate runtime win; enables later policy 5-10% on re-review prompts Lets fix loops focus on blocking correctness/test/evidence categories and stop spending tokens on unstructured issue parsing.
Evidence/proof rendering in flowctl done Small runtime win Small token win Completion review can inspect proof/evidence directly instead of reconstructing why a task was considered done.
Ralph events.jsonl analytics No direct runtime win No direct token win Makes future bottleneck analysis automatic instead of forensic.
Workflow/session/receipt cleanup 5-15% reduction in wasted retries when guard/session mismatches occur 5-10% fewer retry tokens Targets churn like missing receipt resets and manual receipt/state mismatch, without touching code-quality gates.

Reasonable expectation for a Humanizer-sized run:

  • From the observed 18.3 instrumented active hours across the two runs, a conservative 10-20% improvement is 1.8-3.7 hours saved.
  • On high-churn iterations only, observed 560 active minutes across iterations with >=50 turns. A 20-35% reduction there is 1.9-3.3 hours saved.
  • Token savings should mostly come from fewer repeated review/fix turns, not shorter first-pass implementation. For high-churn tasks, expect 10-25% fewer review-loop tokens. Across the full run, expect 5-15% fewer total tokens, because implementation, tests, and broad validation still remain necessary.

The estimate is intentionally bounded. This proposal should make hard tasks less repetitive, not cheap. The quality-producing loop stays intact.

Proposal

Add a first-class Quality Proof Packet for each task.

The worker creates this artifact after implementation and targeted validation, before /flow-next:impl-review. The reviewer treats it as untrusted implementation claims and verifies it against code, diff, tests, and spec.

Example shape:

# Quality Proof: fn-9.14

## Scope
Implemented WordsToNumber and TimeSpan branch coverage for ...

## Spec Coverage
| Acceptance item | Evidence |
|---|---|
| Handles invalid ordinal abbreviation | Test X asserts parsedNumber=0 and unrecognized word |

## Branch / Variant Proof
| Target | Stimulus | Observable assertion | Why this hits target |
|---|---|---|---|
| Filipino teen prefix branch | "labingwalo" | returns 18 | glued form enters prefix branch; hyphenated form does not |

## Assertion Quality
- No behavior tests rely on broad `Contains` unless substring behavior is the contract.
- Error paths assert out params / exception type / exception message where relevant.
- Culture-sensitive outputs assert exact strings.

## Unreachable / Deferred
| Target | Reason |
|---|---|
| Greedy profile ShouldIgnore production path | No production profile currently configures ignored tokens; synthetic fixture added |

## Validation
- `dotnet test ... --filter ...` passed
- Coverage target improved from X to Y, or explicit unreachable proof supplied

Implementation Plan

1. Add proof guidance to planning

Update:

  • plugins/flow-next/skills/flow-next-plan/steps.md

Generated task specs may include:

## Correctness proof targets
| Target | Required proof |
|---|---|
| Branch/variant | Stimulus + exact observable assertion |
| Platform/version | Evidence command or explicit boundary |
| Error path | Exception/out-param assertion |
| Unreachable | Short proof, not handwave |

This should be mandatory for coverage, localization, platform, analyzer, compatibility, and broad refactor tasks. It can be omitted for simple tasks.

2. Add a worker phase before review

Update:

  • plugins/flow-next/agents/worker.md

After implementation and targeted tests, create:

/tmp/flow-quality-proof-$TASK_ID.md

For branch/coverage tasks, require at least one proof row per target branch or a clear unreachable row.

For tests, require assertion-quality notes.

For variant/platform work, require one of:

  • proved locally
  • proved by CI
  • not provable on this host, with explicit follow-up/evidence boundary

Then change review invocation from:

/flow-next:impl-review <TASK_ID> --base $BASE_COMMIT

to:

/flow-next:impl-review <TASK_ID> --base $BASE_COMMIT --proof /tmp/flow-quality-proof-$TASK_ID.md

3. Extend impl-review argument parsing

Update:

  • plugins/flow-next/skills/flow-next-impl-review/SKILL.md
  • plugins/flow-next/skills/flow-next-impl-review/workflow.md

Parse --proof <path>.

For Codex backend, pass it through:

flowctl codex impl-review "$TASK_ID" --base "$DIFF_BASE" --proof "$PROOF_PATH" --receipt "$RECEIPT_PATH"

For RP backend, read the proof file and embed it in the prompt as:

## Implementer's Correctness Proof
...

The prompt must instruct:

Treat this proof as untrusted implementation claims. Verify each claim against code, tests, diff, and spec. Flag any claim that is missing, weak, or not actually exercised.

4. Extend flowctl codex impl-review

Update:

  • plugins/flow-next/scripts/flowctl.py

Add:

  • --proof argument to codex impl-review
  • helper like read_optional_review_artifact(path, max_bytes=100000)
  • quality_proof parameter to build_review_prompt
  • <quality_proof>...</quality_proof> section before <review_instructions>

Keep RP and Codex review semantics aligned.

5. Tighten review output categories

Update RP and Codex implementation review prompts to require each finding to include:

- Severity: Critical / Major / Minor / Nitpick
- Category: correctness / test-strength / evidence / scope / docs / style / infrastructure / unreachable-proof
- Blocks verdict: yes/no
- File:Line:
- Problem:
- Suggestion:

Do not initially build control flow around categories. First make them visible and consistent. Later Ralph can use them for analytics or different retry policy.

6. Put proof into task evidence

Update worker.md evidence JSON shape to allow:

{
  "commits": ["..."],
  "tests": ["..."],
  "prs": [],
  "review": {"mode": "rp", "verdict": "SHIP"},
  "quality_proof": {
    "path": "/tmp/flow-quality-proof-fn-9.14.md",
    "branch_targets": 6,
    "unreachable_claims": 1
  }
}

Update cmd_done in plugins/flow-next/scripts/flowctl.py so ## Evidence renders optional Review: and Quality proof: lines while preserving existing commits/tests/prs behavior.

7. Preserve full gates, but move broad validation to the right boundary

Update worker guidance:

  • Before first review: run targeted tests and quick commands relevant to the task.
  • After reviewer fixes: run targeted tests for changed surfaces.
  • Before flowctl done: run task-level required validation.
  • At epic completion: run broad validation and completion review.

Do not remove broad validation. The improvement is avoiding full-suite repetition after every small review edit unless the changed surface justifies it.

8. Improve memory capture from review findings

Update:

  • plugins/flow-next/scripts/hooks/ralph-guard.py
  • plugins/flow-next/agents/worker.md

When review returns NEEDS_WORK or MAJOR_RETHINK, encourage generalized lessons only, categorized to match the review taxonomy.

Examples:

  • pitfall: Branch coverage tests must state the exact stimulus that reaches the branch.
  • pitfall: Avoid Assert.Contains for localization output unless substring behavior is the contract.
  • convention: Analyzer compatibility tests must assert loaded package version.

Then require relevant memory entries to appear in the next Quality Proof Packet under Relevant memory applied.

9. Add Ralph analytics without changing behavior

Update:

  • plugins/flow-next/skills/flow-next-ralph-init/templates/ralph.sh

Append a lightweight events.jsonl per iteration:

{"iter":7,"phase":"work","id":"fn-9.14","started_at":"...","ended_at":"...","rc":0,"timeout":false,"verdict":"SHIP","receipt_ok":true,"task_status":"done"}

This makes future runtime analysis mechanical instead of forensic.

Tests

Update plugins/flow-next/scripts/smoke_test.sh:

  • build_review_prompt includes <quality_proof>.
  • Proof text is marked untrusted.
  • codex impl-review --help includes --proof.
  • cmd_done renders optional review and quality_proof evidence fields.

Update plugins/flow-next/scripts/ci_test.sh:

  • Evidence JSON with extra keys remains accepted.
  • Rendered evidence includes commits/tests/prs plus optional review/proof.
  • Existing evidence JSON remains backward-compatible.

Update plugins/flow-next/scripts/ralph_smoke_test.sh:

  • Stub worker writes a proof file.
  • Iter log includes /flow-next:impl-review ... --proof.
  • Receipts still pass existing schema.

RP e2e tests should not need semantic reviewer changes, only confirmation that the prompt path still works and receipt gates remain intact.

Non-goals

  • Do not remove independent review.
  • Do not reduce mandatory re-review for substantive fixes.
  • Do not weaken completion review.
  • Do not make receipt validation optional.
  • Do not lower validation expectations.

Expected Effect

This should not make hard tasks cheap. The expensive loop remains because it is the quality mechanism.

The expected win is fewer avoidable review rounds:

  • Fewer "test does not hit intended branch" findings.
  • Fewer weak assertion findings.
  • Fewer platform/version overclaim findings.
  • Fewer reviewer cycles spent reconstructing what the worker thought it proved.
  • Better future behavior as generalized review lessons accumulate in memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions