+For the built-in SOP-Bench and Lifelong AgentBench runners, `patch` is not only a label in the posterior audit context. Observed benchmark failure modes are mapped to concrete patch rules, but they are injected into the next prompt under `Bayesian Failure-Mode Patches` only after the same failure mode has at least two verified occurrences. A single failure remains candidate evidence in `belief_*.json` and `posterior_context_*.md`, which reduces overfitting to one-off mistakes. The prompt does not include raw posterior numbers such as `posterior_success`, `alpha`, or `beta`. For example, repeated `left_expected_output_blank` failures add a CSV writeback verification rule, and repeated `invented_unrequested_column` failures add SQL column-use constraints. v0.x records post-patch evidence back to the same benchmark Skill; later releases may split recurring patches into separate child Skill hypotheses.
0 commit comments