Reduce Ralph review churn with task quality proof packets

## Context

Recent Ralph runs show that the expensive loop is also the reason output quality is high:

`implementation + full reviewer loop + broad validation + workflow bookkeeping`

The goal should not be to remove the loop. The goal should be to reduce avoidable re-review cycles by moving recurring reviewer objections earlier into the worker phase, while preserving independent review, re-review, validation, and receipts.

The observed high-value review findings were mostly not cosmetic. They caught false confidence:

- Tests that passed but did not hit the intended branch.
- Assertions that used broad `Contains` checks instead of exact behavior.
- Platform or version evidence that was claimed but not actually proven.
- Numeric precision cases that looked valid but were not exactly representable.
- Locale/calendar variants missed by the first implementation.
- Coverage tasks where uncovered branches needed either real stimuli or explicit unreachable proof.


## Findings From Recent Ralph Runs

This proposal comes from analyzing the last two Ralph runs on the Humanizer repo, not from a general desire to make the loop faster. The data shows that the long duration was mostly active quality work, and that the slowest cycles were exactly the ones where review found gaps between "tests exist" and "the intended behavior is actually proven."

### Run-Level Data

| Run | Scope | Iterations | Timeout events | Known turns | Instrumented active minutes* | Outcome |
|---|---|---:|---:|---:|---:|---|
| `20260412-220502-e6d1` | Finished most of Urdu locale epic, then started coverage epic | 25 | 3 | 1,252 | 790.5 | Hit `MAX_ITERATIONS` while task `.9` was still in progress |
| `20260413-182801-22c6` | Finished remaining coverage tasks and completion review, then started fn-10 plan review | 10 | 1 | 436 | 307.4 | Hit plan-review infra blocker because RP was unavailable |
| **Total** | Two Ralph runs | **35** | **4** | **1,688** | **1,097.9 min / 18.3 hr** | Work mostly completed; remaining blocker was infrastructure/workflow |

\* Instrumented active minutes are from per-iteration `duration_ms` plus 60 minutes for worker timeouts. Wall-clock elapsed was higher because run start/end spans include gaps and process overhead.

### Concentration of Cost

The cost was concentrated in a small number of high-turn iterations:

| Iteration | Task / phase | Turns | Minutes | What happened |
|---|---|---:|---:|---|
| Run 1 iter 8 | `fn-8.10` | 186 | 37.6 | Hijri calendar support needed review fixes for independent month emission, `UmAlQuraCalendar`, bidi validation, and exact per-month outputs. |
| Run 1 iter 11 | `fn-8.12` | 151 | 58.5 | Large Urdu locale validation surface; high turn count indicates broad code/test/review/validation sweep. |
| Run 1 iter 16 | `fn-8` completion review | 123 | 32.4 | Epic-level review verified the combined Urdu implementation rather than per-task slices. |
| Run 1 iter 19 | `fn-9.3` | 113 | 21.2 | Coverage-gap task with review/validation work. |
| Run 1 iter 4 | `fn-8.3` | 104 | 23.0 | Re-ran after timeout; review had found a real negative ordinal bug and missing high-scale/parse coverage. |
| Run 2 iter 5 | `fn-9.14` | 119 | 55.9 | WordsToNumber/TimeSpan branch coverage needed multiple fixes because some tests did not reach the claimed branches. |
| Run 2 iter 2 | `fn-9.10` | 75 | 52.6 | Core extension coverage needed exact output assertions and real error-branch tests. |
| Run 2 iter 9 | `fn-9` completion review | 73 | 21.2 | Completion review found workflow/spec evidence issues after all tasks were marked done. |

Across both runs:

- **14 iterations had >=50 turns**, consuming **560.1 active minutes**.
- **6 iterations had >=100 turns**, consuming **228.7 active minutes**.
- Median iteration duration was **31.5 minutes**; p75 was **52.0 minutes**.
- **4 worker timeouts** consumed about **240 minutes** by themselves.

This is the problem to optimize: not that review exists, but that the worker often entered review without a compact proof of what it thought it had proven. The reviewer then had to discover weak assertions, wrong branch stimuli, missing variant proof, and evidence overclaims late in the loop.

### What Later Cycles Actually Fixed

The extra cycles produced meaningful correctness improvements:

| Area | Later-cycle finding | Concrete correction | Why it matters |
|---|---|---|---|
| Urdu negative ordinals (`fn-8.3`) | Negative ordinal path appended suffixes before checking exact replacements. | Fixed negative ordinals to consult exact replacements first; added tests for `-1`, `-5`, high-scale cardinal output, and negative parse round-trip. | Real behavior bug. A green build and basic locale tests would not have caught it. |
| Hijri calendar (`fn-8.10`) | Tests covered Hijri conceptually but missed the runtime `UmAlQuraCalendar` variant. | Changed detection to include `UmAlQuraCalendar`; added exact-output month tests. | Users can hit this calendar implementation at runtime. Fragment tests were insufficient. |
| Urdu regional test matrix (`fn-8.6`) | Review found missing `101` feminine ordinal, string neuter fallback, bidi sweep gaps, and missing `UseCulture`. | Added targeted tests and culture binding. | This is locale correctness, not style. Missing culture binding can make tests pass for the wrong reason. |
| Coverage baseline (`fn-9.1`) | Local macOS artifacts were being treated too close to canonical baseline evidence and omitted `net48`. | Reworked artifact status/placement, added real baseline generation structure, compacted huge `uncovered.json`, documented local baseline as informational. | Prevented downstream tasks from using a misleading baseline. |
| Dual Roslyn analyzers (`fn-9.7`) | Tests could have validated the old analyzer path or stale assets while claiming Roslyn 4.14 coverage. | Added isolated intermediate outputs, explicit restore/build behavior, and package-version assertions. | This is a classic false-proof problem: tests pass but do not prove the intended compatibility arm. |
| Engine factory coverage (`fn-9.9`) | Initial tests did not cover enough factory dispatch/value-resolution/token-map branches. | Added scope assertions, reflection-based token-map tests, defensive exception tests, value-resolution tests, and broad factory dispatch coverage. | Branch-coverage task required a branch map, not just more tests. |
| ByteSize/Truncator (`fn-9.12`) | Test used `long.MaxValue` through `double`, which is not exactly representable. | Switched to `1L << 53`, renamed the test to state the precision contract, strengthened exact assertions. | Avoided a mathematically misleading test. |
| WordsToNumber/TimeSpan (`fn-9.14`) | Several tests did not exercise the claimed branches, including Filipino hyphenated vs glued token forms and misunderstood TimeSpan overflow/max-unit paths. | Added branch-accurate stimuli, exact TimeSpan expected strings, exception message checks, out-param checks, and synthetic fixture for an otherwise unreachable `ShouldIgnore` path. | This is the strongest evidence for the proposal: the reviewer repeatedly had to reverse-engineer whether tests actually hit intended branches. |
| Coverage gate (`fn-9.11`) | Malformed/missing XML tests needed stronger assertions. | Strengthened assertions and task evidence. | Reduces chance of a gate script silently accepting bad inputs. |
| Completion review (`fn-9`) | Epic-level review found task/state/spec evidence drift after per-task SHIP verdicts. | Synced task `.11` status and documented task `.6` null-guard scope amendment. | Confirms completion review is useful and should not be removed. |

### Failure / Churn Modes Observed

These are the specific recurring problems the Quality Proof Packet targets:

1. **Branch stimulus ambiguity**: tests existed, but it was unclear or false that they reached the intended branch.
2. **Weak observable assertions**: `Contains` or broad checks passed without proving exact behavior.
3. **Variant proof gaps**: platform/calendar/Roslyn-version variants were claimed without direct evidence.
4. **Evidence overclaiming**: local artifacts were treated as canonical or CI-equivalent before they were.
5. **Unreachable branch ambiguity**: reviewer had to ask whether a branch was truly unreachable or merely untested.
6. **Reviewer reconstruction cost**: every review had to reconstruct the worker's intended coverage map from code, tests, and prose.
7. **Workflow/infrastructure churn**: missing receipts, session mismatch, and RP unavailability caused retries that did not improve code quality.

The current Ralph quality model is directionally right. The loop is slow because it keeps forcing claims to become proof. The opportunity is to make the worker present stronger proof before the first review, then let the reviewer verify that proof independently.

### Estimated Impact

These estimates assume no reduction in review strictness, no removal of completion review, and no weaker validation gates.

| Change | Expected time impact | Expected token impact | Why |
|---|---:|---:|---|
| Quality Proof Packet before first review | **10-20% overall runtime reduction** on broad Ralph runs; **20-35% reduction** on coverage/localization/analyzer-heavy tasks | **10-25% fewer review/fix-loop tokens** on high-churn tasks | Avoids at least one reviewer-discovered weak-test/branch-proof cycle on tasks like `fn-9.12`, `fn-9.14`, `fn-9.7`, `fn-8.10`. |
| Branch/variant proof targets in task specs | **5-10% overall**, higher on coverage epics | **5-15% fewer implementation/review tokens** | Worker starts with explicit proof obligations instead of inferring them after review feedback. |
| Typed review categories | Small immediate runtime win; enables later policy | **5-10% on re-review prompts** | Lets fix loops focus on blocking correctness/test/evidence categories and stop spending tokens on unstructured issue parsing. |
| Evidence/proof rendering in `flowctl done` | Small runtime win | Small token win | Completion review can inspect proof/evidence directly instead of reconstructing why a task was considered done. |
| Ralph `events.jsonl` analytics | No direct runtime win | No direct token win | Makes future bottleneck analysis automatic instead of forensic. |
| Workflow/session/receipt cleanup | **5-15% reduction in wasted retries** when guard/session mismatches occur | **5-10% fewer retry tokens** | Targets churn like missing receipt resets and manual receipt/state mismatch, without touching code-quality gates. |

Reasonable expectation for a Humanizer-sized run:

- From the observed **18.3 instrumented active hours** across the two runs, a conservative **10-20%** improvement is **1.8-3.7 hours saved**.
- On high-churn iterations only, observed **560 active minutes** across iterations with >=50 turns. A **20-35%** reduction there is **1.9-3.3 hours saved**.
- Token savings should mostly come from fewer repeated review/fix turns, not shorter first-pass implementation. For high-churn tasks, expect **10-25% fewer review-loop tokens**. Across the full run, expect **5-15% fewer total tokens**, because implementation, tests, and broad validation still remain necessary.

The estimate is intentionally bounded. This proposal should make hard tasks less repetitive, not cheap. The quality-producing loop stays intact.

## Proposal

Add a first-class **Quality Proof Packet** for each task.

The worker creates this artifact after implementation and targeted validation, before `/flow-next:impl-review`. The reviewer treats it as untrusted implementation claims and verifies it against code, diff, tests, and spec.

Example shape:

```markdown
# Quality Proof: fn-9.14

## Scope
Implemented WordsToNumber and TimeSpan branch coverage for ...

## Spec Coverage
| Acceptance item | Evidence |
|---|---|
| Handles invalid ordinal abbreviation | Test X asserts parsedNumber=0 and unrecognized word |

## Branch / Variant Proof
| Target | Stimulus | Observable assertion | Why this hits target |
|---|---|---|---|
| Filipino teen prefix branch | "labingwalo" | returns 18 | glued form enters prefix branch; hyphenated form does not |

## Assertion Quality
- No behavior tests rely on broad `Contains` unless substring behavior is the contract.
- Error paths assert out params / exception type / exception message where relevant.
- Culture-sensitive outputs assert exact strings.

## Unreachable / Deferred
| Target | Reason |
|---|---|
| Greedy profile ShouldIgnore production path | No production profile currently configures ignored tokens; synthetic fixture added |

## Validation
- `dotnet test ... --filter ...` passed
- Coverage target improved from X to Y, or explicit unreachable proof supplied
```

## Implementation Plan

### 1. Add proof guidance to planning

Update:

- `plugins/flow-next/skills/flow-next-plan/steps.md`

Generated task specs may include:

```markdown
## Correctness proof targets
| Target | Required proof |
|---|---|
| Branch/variant | Stimulus + exact observable assertion |
| Platform/version | Evidence command or explicit boundary |
| Error path | Exception/out-param assertion |
| Unreachable | Short proof, not handwave |
```

This should be mandatory for coverage, localization, platform, analyzer, compatibility, and broad refactor tasks. It can be omitted for simple tasks.

### 2. Add a worker phase before review

Update:

- `plugins/flow-next/agents/worker.md`

After implementation and targeted tests, create:

```bash
/tmp/flow-quality-proof-$TASK_ID.md
```

For branch/coverage tasks, require at least one proof row per target branch or a clear unreachable row.

For tests, require assertion-quality notes.

For variant/platform work, require one of:

- proved locally
- proved by CI
- not provable on this host, with explicit follow-up/evidence boundary

Then change review invocation from:

```bash
/flow-next:impl-review <TASK_ID> --base $BASE_COMMIT
```

to:

```bash
/flow-next:impl-review <TASK_ID> --base $BASE_COMMIT --proof /tmp/flow-quality-proof-$TASK_ID.md
```

### 3. Extend impl-review argument parsing

Update:

- `plugins/flow-next/skills/flow-next-impl-review/SKILL.md`
- `plugins/flow-next/skills/flow-next-impl-review/workflow.md`

Parse `--proof <path>`.

For Codex backend, pass it through:

```bash
flowctl codex impl-review "$TASK_ID" --base "$DIFF_BASE" --proof "$PROOF_PATH" --receipt "$RECEIPT_PATH"
```

For RP backend, read the proof file and embed it in the prompt as:

```markdown
## Implementer's Correctness Proof
...
```

The prompt must instruct:

```markdown
Treat this proof as untrusted implementation claims. Verify each claim against code, tests, diff, and spec. Flag any claim that is missing, weak, or not actually exercised.
```

### 4. Extend `flowctl codex impl-review`

Update:

- `plugins/flow-next/scripts/flowctl.py`

Add:

- `--proof` argument to `codex impl-review`
- helper like `read_optional_review_artifact(path, max_bytes=100000)`
- `quality_proof` parameter to `build_review_prompt`
- `<quality_proof>...</quality_proof>` section before `<review_instructions>`

Keep RP and Codex review semantics aligned.

### 5. Tighten review output categories

Update RP and Codex implementation review prompts to require each finding to include:

```markdown
- Severity: Critical / Major / Minor / Nitpick
- Category: correctness / test-strength / evidence / scope / docs / style / infrastructure / unreachable-proof
- Blocks verdict: yes/no
- File:Line:
- Problem:
- Suggestion:
```

Do not initially build control flow around categories. First make them visible and consistent. Later Ralph can use them for analytics or different retry policy.

### 6. Put proof into task evidence

Update `worker.md` evidence JSON shape to allow:

```json
{
  "commits": ["..."],
  "tests": ["..."],
  "prs": [],
  "review": {"mode": "rp", "verdict": "SHIP"},
  "quality_proof": {
    "path": "/tmp/flow-quality-proof-fn-9.14.md",
    "branch_targets": 6,
    "unreachable_claims": 1
  }
}
```

Update `cmd_done` in `plugins/flow-next/scripts/flowctl.py` so `## Evidence` renders optional `Review:` and `Quality proof:` lines while preserving existing `commits/tests/prs` behavior.

### 7. Preserve full gates, but move broad validation to the right boundary

Update worker guidance:

- Before first review: run targeted tests and quick commands relevant to the task.
- After reviewer fixes: run targeted tests for changed surfaces.
- Before `flowctl done`: run task-level required validation.
- At epic completion: run broad validation and completion review.

Do not remove broad validation. The improvement is avoiding full-suite repetition after every small review edit unless the changed surface justifies it.

### 8. Improve memory capture from review findings

Update:

- `plugins/flow-next/scripts/hooks/ralph-guard.py`
- `plugins/flow-next/agents/worker.md`

When review returns `NEEDS_WORK` or `MAJOR_RETHINK`, encourage generalized lessons only, categorized to match the review taxonomy.

Examples:

- `pitfall: Branch coverage tests must state the exact stimulus that reaches the branch.`
- `pitfall: Avoid Assert.Contains for localization output unless substring behavior is the contract.`
- `convention: Analyzer compatibility tests must assert loaded package version.`

Then require relevant memory entries to appear in the next Quality Proof Packet under `Relevant memory applied`.

### 9. Add Ralph analytics without changing behavior

Update:

- `plugins/flow-next/skills/flow-next-ralph-init/templates/ralph.sh`

Append a lightweight `events.jsonl` per iteration:

```json
{"iter":7,"phase":"work","id":"fn-9.14","started_at":"...","ended_at":"...","rc":0,"timeout":false,"verdict":"SHIP","receipt_ok":true,"task_status":"done"}
```

This makes future runtime analysis mechanical instead of forensic.

## Tests

Update `plugins/flow-next/scripts/smoke_test.sh`:

- `build_review_prompt` includes `<quality_proof>`.
- Proof text is marked untrusted.
- `codex impl-review --help` includes `--proof`.
- `cmd_done` renders optional `review` and `quality_proof` evidence fields.

Update `plugins/flow-next/scripts/ci_test.sh`:

- Evidence JSON with extra keys remains accepted.
- Rendered evidence includes commits/tests/prs plus optional review/proof.
- Existing evidence JSON remains backward-compatible.

Update `plugins/flow-next/scripts/ralph_smoke_test.sh`:

- Stub worker writes a proof file.
- Iter log includes `/flow-next:impl-review ... --proof`.
- Receipts still pass existing schema.

RP e2e tests should not need semantic reviewer changes, only confirmation that the prompt path still works and receipt gates remain intact.

## Non-goals

- Do not remove independent review.
- Do not reduce mandatory re-review for substantive fixes.
- Do not weaken completion review.
- Do not make receipt validation optional.
- Do not lower validation expectations.

## Expected Effect

This should not make hard tasks cheap. The expensive loop remains because it is the quality mechanism.

The expected win is fewer avoidable review rounds:

- Fewer "test does not hit intended branch" findings.
- Fewer weak assertion findings.
- Fewer platform/version overclaim findings.
- Fewer reviewer cycles spent reconstructing what the worker thought it proved.
- Better future behavior as generalized review lessons accumulate in memory.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce Ralph review churn with task quality proof packets #111

Context

Findings From Recent Ralph Runs

Run-Level Data

Concentration of Cost

What Later Cycles Actually Fixed

Failure / Churn Modes Observed

Estimated Impact

Proposal

Implementation Plan

1. Add proof guidance to planning

2. Add a worker phase before review

3. Extend impl-review argument parsing

4. Extend `flowctl codex impl-review`

5. Tighten review output categories

6. Put proof into task evidence

7. Preserve full gates, but move broad validation to the right boundary

8. Improve memory capture from review findings

9. Add Ralph analytics without changing behavior

Tests

Non-goals

Expected Effect

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Run	Scope	Iterations	Timeout events	Known turns	Instrumented active minutes*	Outcome
`20260412-220502-e6d1`	Finished most of Urdu locale epic, then started coverage epic	25	3	1,252	790.5	Hit `MAX_ITERATIONS` while task `.9` was still in progress
`20260413-182801-22c6`	Finished remaining coverage tasks and completion review, then started fn-10 plan review	10	1	436	307.4	Hit plan-review infra blocker because RP was unavailable
Total	Two Ralph runs	35	4	1,688	1,097.9 min / 18.3 hr	Work mostly completed; remaining blocker was infrastructure/workflow

Iteration	Task / phase	Turns	Minutes	What happened
Run 1 iter 8	`fn-8.10`	186	37.6	Hijri calendar support needed review fixes for independent month emission, `UmAlQuraCalendar`, bidi validation, and exact per-month outputs.
Run 1 iter 11	`fn-8.12`	151	58.5	Large Urdu locale validation surface; high turn count indicates broad code/test/review/validation sweep.
Run 1 iter 16	`fn-8` completion review	123	32.4	Epic-level review verified the combined Urdu implementation rather than per-task slices.
Run 1 iter 19	`fn-9.3`	113	21.2	Coverage-gap task with review/validation work.
Run 1 iter 4	`fn-8.3`	104	23.0	Re-ran after timeout; review had found a real negative ordinal bug and missing high-scale/parse coverage.
Run 2 iter 5	`fn-9.14`	119	55.9	WordsToNumber/TimeSpan branch coverage needed multiple fixes because some tests did not reach the claimed branches.
Run 2 iter 2	`fn-9.10`	75	52.6	Core extension coverage needed exact output assertions and real error-branch tests.
Run 2 iter 9	`fn-9` completion review	73	21.2	Completion review found workflow/spec evidence issues after all tasks were marked done.

Area	Later-cycle finding	Concrete correction	Why it matters
Urdu negative ordinals (`fn-8.3`)	Negative ordinal path appended suffixes before checking exact replacements.	Fixed negative ordinals to consult exact replacements first; added tests for `-1`, `-5`, high-scale cardinal output, and negative parse round-trip.	Real behavior bug. A green build and basic locale tests would not have caught it.
Hijri calendar (`fn-8.10`)	Tests covered Hijri conceptually but missed the runtime `UmAlQuraCalendar` variant.	Changed detection to include `UmAlQuraCalendar`; added exact-output month tests.	Users can hit this calendar implementation at runtime. Fragment tests were insufficient.
Urdu regional test matrix (`fn-8.6`)	Review found missing `101` feminine ordinal, string neuter fallback, bidi sweep gaps, and missing `UseCulture`.	Added targeted tests and culture binding.	This is locale correctness, not style. Missing culture binding can make tests pass for the wrong reason.
Coverage baseline (`fn-9.1`)	Local macOS artifacts were being treated too close to canonical baseline evidence and omitted `net48`.	Reworked artifact status/placement, added real baseline generation structure, compacted huge `uncovered.json`, documented local baseline as informational.	Prevented downstream tasks from using a misleading baseline.
Dual Roslyn analyzers (`fn-9.7`)	Tests could have validated the old analyzer path or stale assets while claiming Roslyn 4.14 coverage.	Added isolated intermediate outputs, explicit restore/build behavior, and package-version assertions.	This is a classic false-proof problem: tests pass but do not prove the intended compatibility arm.
Engine factory coverage (`fn-9.9`)	Initial tests did not cover enough factory dispatch/value-resolution/token-map branches.	Added scope assertions, reflection-based token-map tests, defensive exception tests, value-resolution tests, and broad factory dispatch coverage.	Branch-coverage task required a branch map, not just more tests.
ByteSize/Truncator (`fn-9.12`)	Test used `long.MaxValue` through `double`, which is not exactly representable.	Switched to `1L << 53`, renamed the test to state the precision contract, strengthened exact assertions.	Avoided a mathematically misleading test.
WordsToNumber/TimeSpan (`fn-9.14`)	Several tests did not exercise the claimed branches, including Filipino hyphenated vs glued token forms and misunderstood TimeSpan overflow/max-unit paths.	Added branch-accurate stimuli, exact TimeSpan expected strings, exception message checks, out-param checks, and synthetic fixture for an otherwise unreachable `ShouldIgnore` path.	This is the strongest evidence for the proposal: the reviewer repeatedly had to reverse-engineer whether tests actually hit intended branches.
Coverage gate (`fn-9.11`)	Malformed/missing XML tests needed stronger assertions.	Strengthened assertions and task evidence.	Reduces chance of a gate script silently accepting bad inputs.
Completion review (`fn-9`)	Epic-level review found task/state/spec evidence drift after per-task SHIP verdicts.	Synced task `.11` status and documented task `.6` null-guard scope amendment.	Confirms completion review is useful and should not be removed.

Change	Expected time impact	Expected token impact	Why
Quality Proof Packet before first review	10-20% overall runtime reduction on broad Ralph runs; 20-35% reduction on coverage/localization/analyzer-heavy tasks	10-25% fewer review/fix-loop tokens on high-churn tasks	Avoids at least one reviewer-discovered weak-test/branch-proof cycle on tasks like `fn-9.12`, `fn-9.14`, `fn-9.7`, `fn-8.10`.
Branch/variant proof targets in task specs	5-10% overall, higher on coverage epics	5-15% fewer implementation/review tokens	Worker starts with explicit proof obligations instead of inferring them after review feedback.
Typed review categories	Small immediate runtime win; enables later policy	5-10% on re-review prompts	Lets fix loops focus on blocking correctness/test/evidence categories and stop spending tokens on unstructured issue parsing.
Evidence/proof rendering in `flowctl done`	Small runtime win	Small token win	Completion review can inspect proof/evidence directly instead of reconstructing why a task was considered done.
Ralph `events.jsonl` analytics	No direct runtime win	No direct token win	Makes future bottleneck analysis automatic instead of forensic.
Workflow/session/receipt cleanup	5-15% reduction in wasted retries when guard/session mismatches occur	5-10% fewer retry tokens	Targets churn like missing receipt resets and manual receipt/state mismatch, without touching code-quality gates.

Uh oh!

Reduce Ralph review churn with task quality proof packets #111

Description

Context

Findings From Recent Ralph Runs

Run-Level Data

Concentration of Cost

What Later Cycles Actually Fixed

Failure / Churn Modes Observed

Estimated Impact

Proposal

Implementation Plan

1. Add proof guidance to planning

2. Add a worker phase before review

3. Extend impl-review argument parsing

4. Extend flowctl codex impl-review

5. Tighten review output categories

6. Put proof into task evidence

7. Preserve full gates, but move broad validation to the right boundary

8. Improve memory capture from review findings

9. Add Ralph analytics without changing behavior

Tests

Non-goals

Expected Effect

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

4. Extend `flowctl codex impl-review`