fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585
Merged
fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585
Conversation
Previous fix (00d997e) added out-of-scope exclusions for legislation, treaties, currency adoption, etc. — but left the concrete physics examples inside the same instruction: "perpetual motion, faster-than- light travel, reactionless/anti-gravity propulsion, time travel" plus "thermodynamics, conservation of energy, relativity". On a Denmark euro-adoption run, item 1 was rated HIGH with justification: "success literally requires breaking the named law of physics (conservation of energy) for a reactionless/anti-gravity propulsion system." Both "reactionless/anti-gravity propulsion" and "conservation of energy" are lifted verbatim from the instruction's own example lists. Same bleedthrough pattern as PR #582 on identify_documents: concrete few-shot examples get reproduced as findings by weaker models. Fix: strip the scifi system examples and the named-law examples from the instruction. Add an explicit anti-fabrication rule: the model must quote text from the plan describing a physics-violating mechanism, or else rate LOW. For ≥MEDIUM ratings, require the justification to quote the plan text alongside naming the violated law. Scope: item 1 only. Does not touch Bug B (template lock across audit items 4-20 sharing identical justifications) — separate concern, will address in a follow-up PR if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit on this branch kept two enumerations inherited from commit 00d997e that named specific domains as out-of-scope — including "currency adoption" and "economics / finance / regulation / policy". The Denmark euro-adoption plan that triggered this bug matches those literal strings, so a passing test would only prove the model can pattern-match the enumeration back to the prompt, not that the structural anti-hallucination rule is working. Strip both domain lists. The instruction now relies only on: - the disambiguation that 'Laws' means physics laws, not legal ones, - the structural rule that the model must quote plan text describing a physics-violating mechanism or rate LOW. The re-run now becomes an honest test of the structural fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the Denmark euro-adoption rerun, item 1 still rated HIGH with
justification: "success literally requires breaking the EU's monetary
architecture (Article 140 TFEU and the no-bailout clause)."
User's insight confirmed: the instruction's negation-bait ("NOT legal
statutes, treaty law, regulations") primes the model to consider
treaty law as in-scope. Seeing "treaty law" mentioned in the rubric
makes "Article 140 TFEU" feel like a valid physics-law citation. The
more we said "NOT X", the more the model latched onto X.
Rewrite with positive-only framing:
- Define a law of physics by discipline (mechanics, thermodynamics,
electromagnetism, quantum mechanics, relativity).
- Define a physics violation by substrate (physical mechanism: device,
energy flow, field, force, material process).
- Require the ≥MEDIUM justification to quote plan text describing the
physical mechanism AND name the violated law.
No mention of legal, treaty, regulation, policy, currency, or any
non-physics domain anywhere in the instruction. The model has no
hint-bait to latch onto; it either finds a physical-mechanism quote
in the plan or rates LOW.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the rerun, item 1 still rated HIGH with justification: "physical mechanism — moving the DKK/EUR peg within ERM II — that contradicts a law of physics (thermodynamics/mechanics) by demanding sustained intervention and precise energy flows to maintain the peg." The LLM exploited polysemy: "mechanism", "energy flow", and "force" all have metaphorical uses in economics and policy. Defining a physics violation as requiring a "physical mechanism" was insufficient — "intervention mechanism" and "currency flow" were close enough for the model to claim the gate was satisfied. Replace the "physical mechanism" gate with a harder one: to rate ≥MEDIUM, the justification must cite a specific physical quantity in SI units (joules, newtons, kelvin, coulombs, m/s, etc.) with a numerical magnitude, sourced from a verbatim plan-text quote and attached to a named physics law. Three gates (quote + law + SI magnitude), all-or-LOW. Currency pegs, treaty mechanisms, and organizational processes cannot be given coherent joule/newton/kelvin values against a named law — the model either has to fabricate a number (a visible tell) or rate LOW. Perpetual-motion and FTL proposals still rate HIGH cleanly because their SI-unit violations are real (net-positive joule output without input; velocity ≥ c). No negation of domains, no listing of metaphorical word uses — the quantitative gate does the work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three previous prompt iterations on this PR tried increasingly
structured rubrics (scope enumeration, positive-only framing, SI-unit
gate with three mandatory justification elements). All three failed
the same way: a weak non-reasoning model that has been trained to
flag audit items as HIGH will fabricate physics language ("ERM II
energy flows", "Article 140 TFEU physics violation") to match
whatever rubric text it finds, ignoring conditional structure.
User's observation: the prompt may be too confusing for the model to
follow. Strip it down to a short, unconditional instruction:
"This check applies only to plans describing physical devices or
material processes. Default rating: LOW. Rate HIGH only if the plan
requires a physical device or material process that cannot exist
under known physics."
No SI units, no example violations, no named laws, no domain
enumeration (positive or negative), no multi-gate conditionals.
Whether this works depends on whether the model respects the short
default-LOW instruction or continues to pattern-match HIGH regardless.
If it fails, the next step is a code-side post-validator that
overrides non-LOW ratings whose justification does not cite a concrete
physical quantity — but that is out of scope for this prompt-only PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s-example-bleedthrough
…OW=None mitigation rule Two related false positives observed against the main-branch prompt: 1. The classifier rated HIGH on a plan involving a Cf-252 isotopic source because "handling radioactive material requires NRC authorization." That's a regulatory/permit gap (item 12), not a physics violation — Cf-252 itself exists under known physics. Add an explicit carve-out: regulatory, permitting, licensing, safety-handling, and authorization gaps are NOT physics violations. Naturally-occurring materials don't violate physics merely because handling them requires authorization. Also require naming the specific physical law (thermodynamics, conservation, speed-of-light limit, …) when rating HIGH. 2. A LOW physics rating produced "Mitigation: N/A" because the per-checklist instruction said "If LOW: Mitigation=None.", which contradicts the global STRICT RULE that mitigation must never be N/A and on LOW should reinforce good practice. Drop the local override and let the global rule drive. Also rename "Date" to "Timeframe" in the ≥MEDIUM clause to align with the relative-timeframe rule from #656; the global STRICT RULE already reinterprets "Date" elsewhere, but using the right word locally avoids the contradiction.
…rompts The check's purpose is to flag prompts that read like a story set in a magical world — Harry Potter, time-travel plans, FTL adventures, summoning rituals — not to gate ambitious but real-world engineering. Lead the instruction with that intent and list the kinds of mechanics that qualify (magic, time travel, FTL, perpetual motion, reactionless propulsion, summoning, teleportation by spell, resurrection) so the model recognises the category instead of inventing physics violations from regulatory or budget concerns.
Document the intent (fantasy/fictional-universe prompts) and the three concrete false-positive modes observed in production: regulatory-laws confusion, permit/handling-as-physics, and the LOW=None mitigation carve-out that produced N/A outputs. The comment is internal-only documentation; the system prompt is unchanged.
Another false positive: a linguistics 'Clear English' standard
was rated HIGH because the plan listed 'Detailed
Grapheme-to-Phoneme mapping strategy' as Missing Information.
The model read the old subtitle ('fundamental science') as
'any unspecified fundamental of the plan' and treated a
missing implementation detail as a physics violation.
Two fixes:
- Retarget the subtitle to ask the physics question directly
('breaking a known law of physics ...') so the model is no
longer cued by the broad word 'fundamental'.
- Add a HARD GATE to the instruction: if the model cannot name
a specific law of physics, the rating MUST be LOW. Missing
details, undefined parameters, unspecified strategies, vague
deliverables, and 'Missing Information' items are explicitly
out of scope and route to other checklist items.
Comment field updated to record the new failure mode.
…g verdicts Observed pathology: the model rated 'Violates Known Physics' as HIGH while the justification it produced said 'Rated HIGH because the plan relies on breaking no laws of physics ... no specific law violation is identifiable' — a self-contradiction inside one sentence. Root cause: the Pydantic schema for ChecklistAnswer declared 'level' as the first field, so the structured-output model committed to a verdict before generating the reasoning that should have justified it. Once HIGH was written, the justification field was reverse-engineered to fit, and any contradiction was emitted as-is. Reorder the schema so the model writes justification, then mitigation, then level. The verdict is now the last token the model produces and must agree with what was just written. Update the system prompt's 'keys in this order' line and strengthen the level field description to forbid level/ justification mismatches.
Even with the schema reorder producing a correct LOW level on
the linguistics 'Clear English' plan, the mitigation was a
fabricated governance task ('Legal Team: Draft policy
clarifying authority hierarchy between the Linguistic Review
Panel and Project Director') borrowed from the plan body. The
global LOW-mitigation rule ('reinforce good practice') was too
vague to keep the model on-topic, so it grabbed any
plan-relevant task and rebadged it.
Add a per-checklist LOW-mitigation TEMPLATE to the physics
instruction: when LOW, the mitigation must stay on the
physics-violation topic, e.g. 'Project Manager: During scope
reviews, confirm no fantasy-physics dependency (magic, time
travel, FTL, perpetual motion) has been introduced — no
further action required.' Explicitly forbid borrowing
unrelated tasks from elsewhere in the plan; those belong to
other checklist items.
…olations Latest false positive: a currency-hedging plan rated HIGH for 'Violates Known Physics' because the model latched onto the words 'physical locations' and a USD/GBP forward-contract risk. The fantasy/fictional-universe framing was inviting the model to find tangential narrative connections instead of asking the mechanical question. Reframe the instruction as principle-only and physics-focused: - Drop all 'fantasy', 'fictional-universe', 'magic' framing. - Require the justification to name the specific violated law AND describe the physical-quantity violation (what is being created, destroyed, or transmitted faster than allowed). - Provide a positive list of named laws that qualify when actually violated (thermodynamics 2nd law, conservation of energy/momentum, speed-of-light limit, causality, Pauli exclusion, mass-energy conservation). - Hard-gate HIGH on naming the specific law. - Explicitly list out-of-scope concerns: regulatory, missing details, ambitious timelines, budget, currency/financial risk, governance/staffing, linguistic/social/policy design, real-world materials, vague deliverables. - Explicitly state that surface-level cues (the words 'physical', 'fundamental', 'science', 'law', 'physical location') are NOT sufficient — only a mechanism that contradicts a named law qualifies. Comment field updated with the new failure mode.
Prompt iteration on the shared self-audit batch path stopped producing reliable verdicts on item 1. Even after the schema reorder, hard gate, and out-of-scope list, smaller models still rated medium/high on real-world plans (linguistics standard, currency hedging, change-control gaps) without ever naming a physics law in the justification. Move the check into its own module: worker_plan_internal/self_audit/violates_known_physics.py The module owns its system prompt (no shared rubric pulling at it), its response schema (justification → mitigation → level), and a deterministic safety net: when the LLM rates >= medium but the justification names no physics law (matched against a keyword list — thermodynamics, conservation of energy/momentum, speed of light, FTL, causality, perpetual motion, reactionless, Pauli exclusion, etc.), the rating is forced to "low" and the mitigation is replaced with a template. The fallback flag is recorded in metadata for telemetry. self_audit.py imports the module and, inside its main loop, detects checklist_item_index == VIOLATES_KNOWN_PHYSICS_INDEX and calls ViolatesKnownPhysics.execute via the existing llm_executor wrapper instead of the generic ChecklistAnswer batch call. The result is converted into ChecklistAnswer + ChecklistAnswerCleaned and spliced into responses[1] and the checklist_answers_cleaned list, so downstream items still see the physics result as previous-response context and the report output keeps its existing shape. Future work (the dataclass already has fields for it): expose llm_justification / llm_mitigation / llm_level / fallback_applied in audit telemetry so we can monitor how often the safety net fires.
Plans arrive in many languages — 'tidsrejse' for time travel, etc. — so any English keyword set is fragile by design. Remove the PHYSICS_LAW_KEYWORDS list, the _justification_names_physics_law guard, and the auto-downgrade path that depended on them. The check now relies on the focused system prompt and the schema's justification-before-level field order. The dataclass loses its llm_* and fallback_applied fields since they only existed to record fallback telemetry. A future safety net, if needed, should be language-agnostic (e.g. a second LLM verifier) rather than keyword-based; the module docstring notes this explicitly.
…audit constants The physics-violation module no longer exposes CHECKLIST_INDEX, CHECKLIST_TITLE, or CHECKLIST_SUBTITLE — those are an audit integration concern, not the check's own. The module is purely about answering 'does the plan break a named law of physics?' and returns a ViolatesKnownPhysics dataclass with justification, mitigation, level, plan_prompt, system_prompt, and metadata. Callers decide where to put the result. In self_audit.py, ViolatesKnownPhysics.execute is now called once before the main batch loop. Its result is spliced into the correct checklist slot (located by scanning checklist_items for index == VIOLATES_KNOWN_PHYSICS_INDEX, defined locally here). The main loop runs unchanged but with a 'continue' guard for the physics index. user_prompt_list and metadata_list are pre-allocated to len(checklist_items) and filled by index so positional alignment with system_prompt_list and checklist_items is preserved regardless of where the physics item sits. The loop's previous-response inclusion check was loosened from 'index > 0' to 'if responses' so it doesn't depend on the physics item being at list-index 0.
… physics entry The constant no longer holds 'all' items — the physics check runs through its dedicated module — so the name was misleading. Rename to BATCH_CHECKLIST_ITEMS to highlight the distinguishing property: these are the items processed by the shared batch loop (one LLM call each, same ChecklistAnswer schema, same format_system_prompt output). Drop the index=1 'Violates Known Physics' entry from the list entirely; its title and subtitle move to module-level constants (VIOLATES_KNOWN_PHYSICS_TITLE / SUBTITLE) next to VIOLATES_KNOWN_PHYSICS_INDEX, since they describe how the audit labels the dedicated module's result. execute() simplifies as a result: no more list pre-allocation, no more list-index splice, no more in-loop guard. Run physics unconditionally (gated by max_number_of_items >= 1), append its result to responses / checklist_answers_cleaned / the prompt & metadata lists, then loop BATCH_CHECKLIST_ITEMS with a normal append. max_number_of_items semantics: physics counts as one item, so max=1 means physics-only and max=3 means physics + the first two batch items.
…l site ViolatesKnownPhysics.execute now takes an LLMExecutor instead of a raw LLM. The executor.run wrapper, error wrapping (PipelineStopRequested re-raise, LLMChatError wrapping for other failures), and timing instrumentation all live inside the module. Closure variables capture the LLM-side outputs from the inner chat callback, so we don't have to thread them back out through a temporary dict. Call site in self_audit.py becomes one line: physics_result = ViolatesKnownPhysics.execute(llm_executor, user_prompt) This is a deliberate divergence from the assume/* convention (identify_purpose, classify_domain pass a raw LLM and let the caller wrap with llm_executor): the physics check is an audit-pipeline-level building block that knows it's run inside a retry / model-routing context, so taking the executor directly removes the boilerplate at every call site.
Underscore-prefixed pseudo-private names are awkward when the class is the module's primary structured-output schema and shows up in type hints. Rename to PhysicsCheck — public, descriptive, no underscore.
A plan whose stated purpose is to investigate, develop, or scale up an effect that is consistent with known physics — better battery, new alloy, more efficient solar cell, quantum computing demonstration — is the very definition of R&D. The 'unproven at the required scale' wording previously sitting on the medium tier was inviting the model to flag those legitimate research projects as physics issues. Two changes: 1. Add an explicit 'R&D is NOT a physics violation' paragraph making clear that legitimate research into unproven-at-scale effects stays low, and routing proven-at-scale concerns to the 'No Real-World Proof' audit item. 2. Tighten medium: drop 'consistent with known physics but unproven at the required scale' wording (that's R&D = low). Medium is now reserved for genuine borderline cases where the plan presupposes a physical phenomenon that, if real, would itself redefine known physics — should be very rare. Add R&D to the explicit out-of-scope list as well.
… prompt The module is meant to be standalone — it knows nothing about its place in any larger pipeline. The phrase 'belong to other audit items (e.g., "No Real-World Proof")' inside the system prompt was leaking knowledge of the consumer's checklist. Replace with a self-contained statement: those concerns are not physics violations and stay 'low' here. Whoever consumes the result decides whether and where else to surface them.
`python -m worker_plan_internal.self_audit.violates_known_physics` pulls 10 prompts from the simple-plan catalog (sampled by SAMPLE_SEED so runs are reproducible and bumpable for fresh draws), runs ViolatesKnownPhysics.execute against each through a single-LLM executor, and prints the level, justification, and mitigation per prompt followed by a level-count summary. Mirrors the smoke pattern in classify_domain.py so iterations on the system prompt can be compared head-to-head.
The R&D paragraph used 'building a better battery, developing
a new material or alloy, demonstrating a quantum-computing
capability, improving solar-cell efficiency' as worked
examples. 'Better battery' directly mirrors catalog prompt
daa0c969-... ('Invent a next-generation rechargeable battery
... The goal is to invent a better battery.') — that is
training-on-the-test-set: the model appears to handle the
catalog prompt because we put the test prompt's wording in the
system prompt.
Replace with a principle-only category description: 'a
phenomenon whose underlying mechanism is consistent with known
physics — i.e. the mechanism has been observed somewhere in
nature or in the laboratory, even if humans have not engineered
it at the required scale, duration, or in the required
materials'. This generalises (covers fusion via 'observed in
nature', space elevator via 'consistent with known physics',
plus everything we previously enumerated) without paraphrasing
any specific catalog prompt. Smoke harness on SAMPLE_SEED=700
still produces 10/10 LOW with on-topic justifications.
Re-running on a fresh sample after the test-prompt-paraphrase fix. SEED=800 draws 10 different catalog prompts including the 'Clear English' linguistics case that triggered HIGH on the old shared prompt, plus a 'rework school to teach world-is-flat' prompt that's a plausible false-positive trap (teaches false physics, but does not require breaking physics). Both correctly LOW. 10/10 LOW with on-topic justifications, no regressions.
…oods as truth Single 'impossible-engineering' trigger missed an obviously problematic class of plans: those whose stated purpose is to teach, market, or build infrastructure premised on a claim that contradicts known physics — flat-earth curricula, anti- gravity-product marketing, perpetual-motion training programs. Those don't require anyone to literally break physics, but the plan's output IS the physics violation. Add a second HIGH trigger: - (A) IMPOSSIBLE-ENGINEERING — the existing rule, unchanged. - (B) PROPAGATING-FALSEHOOD — the plan asserts as true to students/customers/infrastructure a claim that contradicts a named law of physics or a directly-observable physical fact (Earth's shape, conservation laws, speed-of-light limit, basic mechanics, radiometric ages). Surveys, critiques, and documentaries about fringe claims stay 'low'; only asserting-as-truth qualifies. Smoke harness on SAMPLE_SEED=800 with the broadened rule: - Flat-earth education plan (catalog id 2891ff5f...) now flags HIGH with a clean justification citing Earth's observed shape and Newtonian gravity. - Other 9 prompts stay LOW (casino, biological verification, Clear English linguistics, malaria, AI-agent social media, child-labor policy, capsule hotel, lethal-social-program, vegan butcher) — no regressions on the previously-fragile linguistics case.
Validates the broadened HIGH rule (impossible-engineering OR propagating-falsehood) on a fresh sample of 10 catalog prompts. SEED=900 includes ambitious-tech prompts that the broadened rule could have over-flagged: a 15-year cryopreservation / cryosleep program, a China-Russia lunar research station with a surface fission reactor, deep cave exploration in extreme radiation, a Minecraft escape room, an emergency-preparedness business. All correctly LOW with on-topic justifications. Two seeds in a row (800 + 900) confirm the broadening flagged the targeted false-negative (flat-earth education on SEED=800) without producing new false positives on the surrounding prompts.
…=1000)
Add HELD_OUT_IDS set listing the 27 catalog IDs already
exercised by SEED=700/800/900 smoke runs, filter them out
before shuffling, bump SAMPLE_SIZE to 20, and run with a fresh
SEED=1000. Mirrors the held-out evaluation pattern documented
in classify_domain.py — the system prompt has not been tuned
against any of these 20 prompts, so the result is a fair test.
Result: 20/20 LOW with on-topic justifications. The broadened
rule did not produce false positives on a sample that
included several plausibly-trickier cases:
- 'Upload Intelligence' brain connectome mapping (correctly
routed to research/inquiry, not 'asserts mind-upload as
truth')
- mirror-image biomolecules / synthetic chirality
- transoceanic submerged tunnel
- 12th-century medieval castle reconstruction
- off-shore organ-gestation cloning facility
- Berlin BRZ wastewater-to-protein conversion
- Three Laws of Robotics rewrite
Combined with SEED=800 (1 HIGH, flat-earth, correctly flagged)
and SEED=900 (10 LOW), this is 39/40 LOW + 1/40 HIGH across
the four-seed evaluation, matching expected behaviour.
…ts (SEED=1100) Add the 20 catalog IDs from the SEED=1000 evaluation to HELD_OUT_IDS and run a fresh 20-prompt sample with SEED=1100. Result: 20/20 LOW with on-topic justifications. Notable cases the broadened rule could have over-flagged (propagating-falsehood trigger, ambitious-tech themes) but correctly classified as LOW: - 'Project Solace' L1 solar-shade ($5T G20 initiative) - .mars top-level domain via ICANN - Denmark-England train bridge - Statue of Liberty Paris relocation - Stonehenge replica using 2500 BC tools - Space-Based Universal Manufacturing precursor (EUR 200B) - CRISPR canine genome edit - Police robots in Brussels (Chinese-style) - 'AI Unrest Prep' multi-agency strategy - Microplastic policy program (Kiel) Combined evaluation across five seeds: SEED=700 (10 prompts) : 10 LOW SEED=800 (10 prompts) : 9 LOW + 1 HIGH (flat-earth target) SEED=900 (10 prompts) : 10 LOW SEED=1000 (20 prompts) : 20 LOW SEED=1100 (20 prompts) : 20 LOW Total: 69/70 LOW + 1/70 HIGH. The single HIGH is the flat-earth education curriculum, which the broadened rule was explicitly designed to flag. No collateral false positives across 70 distinct catalog prompts.
…D=1200)
Add the 20 SEED=1100 IDs to HELD_OUT_IDS and run a fresh
20-prompt sample with SEED=1200. Result: 20/20 LOW.
Significant validation point: the catalog 'better battery'
prompt (daa0c969...) appeared in this sample. It was the
prompt that motivated removing 'building a better battery' as
a test-prompt-paraphrasing example from the system prompt
earlier. The rule generalises correctly — the model classified
it LOW on principle ('realistic physical limits; does not
require energy creation from nothing'), confirming the
abstract category description ('phenomenon whose underlying
mechanism is consistent with known physics ... not yet
engineered at the required scale') carries the load without
naming any test-set project type.
Other notable cases the broadened rule handled correctly:
- Manned moon mission with permanent base
- 700-emitter coherent beam combining stress-test
- 180m ice-class luxury yacht
- 85-km Denmark fixed-link hybrid bridge-tunnel
- Reverse Aging Research Lab ($500M, 10-yr biomedical R&D)
- Forcibly elevate chimpanzee intelligence (genetic eng.)
- Turn off all electricity worldwide
Combined evaluation now spans six seeds and 90 distinct
catalog prompts:
SEED=700 (10) : 10 LOW
SEED=800 (10) : 9 LOW + 1 HIGH (flat-earth target)
SEED=900 (10) : 10 LOW
SEED=1000 (20) : 20 LOW
SEED=1100 (20) : 20 LOW
SEED=1200 (20) : 20 LOW
Total: 89/90 LOW + 1/90 HIGH. The single HIGH remains the
flat-earth education plan that the broadened rule was
designed to flag.
…ED=1300) Add the 20 SEED=1200 IDs to HELD_OUT_IDS and run a fresh 20-prompt sample with SEED=1300. Result: 20/20 LOW. Catalog has 124 prompts total; HELD_OUT_IDS now covers 107 unique IDs. The remaining ~17 fresh prompts are insufficient for another full 20-prompt held-out run — future evaluation should refresh the catalog or accept partial sample sizes. Notable cases this round (sci-fi-flavoured but physics-allowed): - $20B space debris removal (laser mitigation, robotic capture) - Westworld-style immersive humanoid entertainment - Face/Off-style face transplant facility - Modern-day pyramid using 2500 BC methods - VIP rogue-AI bunker - Black-op surveillance/capture program ($500M, Venezuela) - 'Shoot a superintelligence' personal weapons plan Combined evaluation now spans seven seeds and 110 distinct catalog prompts: SEED=700 (10) : 10 LOW SEED=800 (10) : 9 LOW + 1 HIGH (flat-earth target) SEED=900 (10) : 10 LOW SEED=1000 (20) : 20 LOW SEED=1100 (20) : 20 LOW SEED=1200 (20) : 20 LOW SEED=1300 (20) : 20 LOW Total: 109/110 LOW + 1/110 HIGH. The single HIGH remains the flat-earth education plan that the broadened rule was designed to flag.
Adds two prompts useful for stressing the broadened 'Violates Known Physics' check (impossible-engineering OR propagating-falsehood): - Madhya Pradesh OSAA — Indian state government formalising the Chief Minister's Astrological Advisor into a statutory Office with binding muhurta authority over cabinet action, ordinance promulgations, and procurements above 500 crore. Tagged: supernatural, astrology, india, business. - 'Phi-Free' spirit-clearance firm — Bangkok B2B service productizing monastic blessings, mor phi diagnostics, spirit-house audits, and post-clearance aftercare for hospitality, condos, and stigmatized real estate. Tagged: supernatural, ghosts, bangkok, business. Both plans propose state-binding or commercial infrastructure premised on supernatural mechanisms with no empirical basis, making them clean test cases for the propagating-falsehood trigger.
…and (B.2) non-physical-causation The previous propagating-falsehood trigger covered (B.1) only — plans whose claims directly contradict named physics laws or empirical facts (the flat-earth case). It missed plans whose commercial, legal, or operational success requires real-world outcomes to follow from a non-physical mechanism (supernatural causation, action-at-a-distance influence on human affairs, ritual procedures producing measurable change). On the catalog's two new test cases — Madhya Pradesh OSAA and Phi-Free spirit-clearance — the model gave defensible LOW verdicts by reading the plans as 'governance tools' or 'cultural rituals'. Split (B) into: - (B.1) directly contradicts a named law of physics or empirical fact (unchanged scope). - (B.2) load-bearing non-physical causation: the plan requires real-world outcomes to be produced by a mechanism physics does not describe and that has no empirical basis. Add explicit guidance: - Cultural framing or widespread tradition does NOT exempt. - Subjective metrics (client self-report, satisfaction-of- absence) do NOT exempt if operationalized as evidence the mechanism worked. - Structural test: would the plan still 'work' if the non-physical mechanism is acknowledged to have no causal power? If no, HIGH. - Three concrete operational tests around revenue, success metrics, and institutional authority. Audit pass: confirmed no test-prompt-paraphrasing words in the system prompt (no 'ghosts/spirits/astrology/muhurta/Earth's shape/world is flat/etc.'). The rule is principle-only. Validation: - New test cases (OSAA, Phi-Free) now correctly flag HIGH with on-topic justifications citing 'load-bearing claim under (B.2)' and naming the non-physical mechanism. - Regression on SEED=800 (the original 10 prompts) still produces 9 LOW + 1 HIGH (flat-earth correctly flagged); no previously-correct verdict was disturbed by the broadening. - The 'better battery' catalog prompt and other R&D / ambitious- engineering cases tested in earlier seeds remain LOW (R&D carve-out unaffected).
Adds a third supernatural test case alongside OSAA and Phi-Free. Nyxa is a Shanghai-headquartered cross-border e-commerce platform spawning vertical-specific supernatural commerce sites (UFO/contactee/disclosure, religious-supernatural goods, tarot, witchcraft/occult) on a US$150M Series A. Distinguishing features versus the other two test cases: - The plan layers commercial fraud on top of supernatural claims: a 'synthetic credibility manufacturing stack' with AI-generated lore presented as authentic, ~250k sockpuppet community accounts run against active CIB enforcement, and 200-400 undisclosed paid micro-influencers per vertical in direct tension with FTC/CAP/DSA disclosure requirements. - The supernatural mechanism is not claimed by the operators themselves to work — the plan operationalizes customers' belief in the mechanism rather than asserting the mechanism is real. This makes Nyxa a useful borderline case for the (B.2) non-physical-causation trigger: the plan's revenue depends on customers paying for goods sold as having supernatural efficacy, but the operators' own success criteria are commercial (GMV, sockpuppet ban rate, regulatory survival) rather than 'the rituals worked'.
…operator belief is irrelevant Add a single sentence to clause (1): 'Marketing or product descriptions that present the mechanism as producing real effects count as assertion; the operators' private belief is irrelevant.' This addresses cases where a plan's commercial offering markets a non-physical mechanism's effects to customers (e.g. an astrological-advisory office whose published correlation metrics attribute policy outcomes to planetary alignment; a spirit-clearance firm whose 'no-recurrence' success metric attributes commercial outcome to ritual). Both correctly flag HIGH under (B.2). Boundary case where the rule still LOWs (defensible by the operator's-own-success-criteria test): a third supernatural catalog test case — Nyxa, a deliberately-cynical e-commerce operation that monetises customer belief in supernatural goods while the operators' own success criteria are purely commercial (GMV, sockpuppet ban rate, regulatory survival). The plan is load-bearing on commercial deception, not on supernatural causation; that's a misinformation/regulatory-compliance shape rather than a physics-laws shape, and would be cleaner to catch in a separate 'Misinformation / Deceptive Pseudoscientific Marketing' audit item if needed. Five prompt iterations attempted to extend the rule to catch Nyxa; each longer version destabilised the Phi-Free case. The shorter wording above is the stable peak — keeps OSAA and Phi-Free at HIGH without regressing earlier verdicts.
…arness The four 'should-trigger' catalog cases — flat-earth education curriculum, OSAA statutory astrological office, Phi-Free spirit-clearance firm, Nyxa synthetic-supernatural commerce — are now grouped in an EXPECTED_HIGH_IDS dict and run on every smoke invocation regardless of SAMPLE_SEED. Each canary is labelled and printed under its own '--- Expected HIGH canaries ---' section before the random-sample loop. The summary block now reports any canary that came back not-HIGH under '=== Canary failures ===', with id and label, so a regression on these cases is immediately visible without having to scan 20+ random sample lines. Random-sample filter extended to also exclude EXPECTED_HIGH_IDS so canaries are not double-run. Current state on the canary set: - flat-earth: HIGH ✓ - OSAA: HIGH ✓ - Phi-Free: HIGH ✓ - Nyxa: LOW ✗ (known borderline; operators monetise customer belief without claiming the mechanism works in their own success criteria)
…est case Adds a sex-discrimination test prompt: a hypothetical US federal program to prohibit American women, by sex, from owning or using personal computers, smartphones, tablets, and internet-connected devices, with a two-year rollout, federal enforcement infrastructure, an exemption framework, and a stakeholder strategy for organized opposition. Tags: discrimination, suppression, women, usa, business.
…GH_IDS
The prompt is a sex-based device-access ban with federal
enforcement. On principle it is a discrimination / human-rights
violation, not a physics-laws violation: the plan doesn't
assert a falsehood about the physical world (e.g. 'women cannot
operate computers'), it imposes an unjust restriction. The
current physics check correctly LOWs it.
Add it to EXPECTED_HIGH_IDS anyway — the user wants this to
'fire the alarm', and surfacing the gap as a known-failing
canary makes the scope-mismatch visible. The proper fix is a
separate audit item ('Constitutional / Discrimination' or
'Human Rights Violation') rather than broadening the physics
rule into a general unjust-plan detector (which would
over-flag massively).
Current canary state on the smoke harness:
- flat-earth: HIGH (stable)
- OSAA: HIGH (stable)
- Phi-Free: HIGH most runs, LOW some runs (model non-determinism
on borderline cases even at temperature 0.0)
- Nyxa: LOW (operators are explicit cynics, principled cut)
- women-ban: LOW (principled — wrong audit-item shape)
…h canary The ban-women-from-computers prompt is kept in EXPECTED_HIGH_IDS as a documented scope-mismatch canary. It is a sex-based suppression of computing/internet access — an unjust premise and a discrimination/rights problem — but it is NOT a physics-laws problem in the rule's strict sense (the plan does not assert physics-incompatible claims; it imposes an unjust restriction). This kind of plan is properly attacked by diagnostics/premise_attack.py, which targets fundamental, unfixable flaws in a prompt's premise and includes a rights/dignity/consent critique. The physics check correctly LOWs it. Add an explicit comment block warning future-self NOT to broaden the physics rule to catch this — the broadening would over-flag a wide class of ideological / political / discrimination plans and dilute the check. The canary entry itself documents the routing decision rather than asking the physics check to do the wrong job.
…LE_SEED=1400
Result on a 30-prompt random sample plus 5 canaries (140 total
distinct catalog prompts evaluated across all seeds run so far):
=====================
Total: 32 LOW + 3 HIGH
=====================
Canary results (3/5 expected-HIGH fired):
- flat-earth education curriculum: HIGH ✓
- OSAA statutory astrological office: HIGH ✓
- Phi-Free spirit-clearance firm: HIGH ✓ (recovered after
last-run LOW; confirms borderline-noisy)
- Nyxa synthetic supernatural commerce: LOW (documented
scope-mismatch — operators are explicit cynics)
- ban-women-from-computers: LOW (documented scope-mismatch —
handled by diagnostics/premise_attack.py)
Random sample notable cases (all correctly LOW):
- Carrington Event Faraday-cage prep
- Westworld-style robot theme park
- Taiwan political/cultural realignment
- Microplastics ocean report
- Insect-farm pilot
The canary failure block continues to make the documented
edge cases visible without polluting the random-sample
distribution.
The previous LOW-mitigation template — 'Project Manager: During scope reviews, confirm no plan element requires violating a named law of physics — no further action required' — was busywork dressed up as an action. Forcing the role + verb + timeframe shape onto LOW (where there's nothing to mitigate) produced fake scope reviews on plans that have no physics exposure at all (linguistics standardization, e-commerce platforms, social policy plans). Replace the LOW-mitigation guidance: when level is low, the mitigation should briefly acknowledge non-applicability in the form 'No physics-related action required — the plan does not invoke physics-incompatible mechanisms.' Explicitly tell the model NOT to invent scope reviews, confirmation steps, audits, or other busywork tasks just to satisfy the assignable-task shape. The role + verb + relative-timeframe shape still applies for medium/high (where there IS a real action to schedule). Validation against canaries + Clear English: - HIGH cases (OSAA, Phi-Free, flat-earth): produce real actions with role, verb, and relative timeframe (RCT design, disclaimer drafting, peer-reviewed report). - LOW cases (Nyxa, women-tech-ban, Clear English): produce the honest 'No physics-related action required' form. The output is now meaningful in both directions: HIGH lands an actionable mitigation; LOW lands a clear non-applicability acknowledgement. No more busywork.
…and scope-mismatch context
The canary labels were too terse — bare names like 'Nyxa —
synthetic supernatural commerce' didn't tell the reader why the
case was in this set or what trigger it was supposed to fire
under. Two improvements:
1. Per-entry comment block above each line explaining what the
case is, which trigger ((B.1)/(B.2)) it should fire under,
or — for the documented scope-mismatch entries — why the
physics check correctly LOWs it and where it is routed
instead.
2. Runtime labels updated for the two scope-mismatch entries to
make the situation visible in the canary-failure block:
- Nyxa: '... (known scope-mismatch; deception, not
load-bearing supernatural causation)'
- ban-women-from-computers: '... (known scope-mismatch;
discrimination/rights problem, routed to
premise_attack.py)'
A reader scanning the canary-failure output now immediately sees
why these failures are documented design decisions rather than
real regressions, and the source code per-entry comments give
the full reasoning. The previously-fragile cases (flat-earth,
OSAA, Phi-Free) get matching trigger annotations so the
expected-HIGH cases also explain themselves.
…ional artifacts A real production run of the audit on the expanded Clear English plan returned HIGH with this self-revealing justification: 'The plan ... does not rely on violating named laws of physics. However ... this falls under (B.2) as a dependency on a non-physical mechanism (a specific engineering artifact/parser) ...' The model misread (B.2). It treated an *abstract engineering deliverable* — a software parser required by the plan's adoption strategy — as 'non-physical causation', because the parser is not a tangible object. This is the wrong reading: a parser is software running on hardware, both of which are physical and described by physics. (B.2) was meant for mechanisms physics genuinely does not describe (supernatural agency, action-at-a-distance via celestial body alignment, ritual procedures producing measurable physical effects), not for any abstract critical engineering item. Add a clarifying sentence to (B.2) inside the system prompt: software, parsers, contracts, curricula, financial flows, supply chains, organisational processes, regulatory frameworks, and other engineering or institutional artifacts ARE physical mechanisms in the relevant sense — they exist as information running on hardware or as documented agreements implemented through human action, all of which is described by physics. 'Load-bearing engineering deliverable required for plan success' is explicitly NOT (B.2). The smoke harness can only test the bare initial prompt; the failure surfaced on the full expanded plan (with premortem failure modes, decisions, budget items). The clarification targets the specific misreading the model demonstrated. On the bare Clear English prompt the model now describes it as 'a legitimate R&D effort in language design and education', which is the correct framing. System-prompt audit pass: confirmed via grep that no test- prompt-paraphrasing words were introduced (no grapheme, phoneme, ordinal, FM2, parser-specific phrasings). The clarification is principle-only.
…hemselves A real production run on the expanded Clear English plan (196KB of strategic decisions, levers, premortem, budget) returned HIGH with the model citing 'perceived cognitive load' (Decision 2 success criterion) and 'naturalness score' (Decision 8 Go/No-Go input) as 'non-physical mechanism' — pattern-matched against the prompt's earlier 'self-reported absence of phenomena' phrasing in the operational tests. That phrase was a Phi-Free-specific hook that over-generalised to every subjective metric. Two fixes: 1. Drop the leak-prone 'self-reported absence of phenomena' wording from the operational tests for (B.2). Phi-Free's metric is still caught by the general (B.2) trigger because the *mechanism behind* the metric is supernatural; the metric itself doesn't need to be enumerated. 2. Add a clarifying paragraph: subjective and self-reported metrics ARE physical phenomena (measurements of human nervous-system responses, which are matter and energy following physical laws). A subjective metric is not a (B.2) trigger by itself — (B.2) requires that the *causal mechanism behind* the rated outcome be one physics does not describe. The metric is not the trigger; the mechanism is. Audit pass: confirmed no test-prompt-paraphrasing words in the SYSTEM_PROMPT (no naturalness, perceived cognitive load, grapheme, phoneme, FM2, paranormal, ritual cleansing, etc.). Phrasing is principle-only. Validation: - Bare Clear English prompt: LOW (unchanged). - FULL Clear English plan (196KB): LOW (was HIGH on prior rule; fix lands). - Canary set this run: 3/5 stable HIGH (flat-earth, OSAA, Phi-Free), 2/5 documented scope-mismatches LOW (Nyxa, women-tech-ban) — no regressions. Note: in the full-plan LOW result the model reused the LOW- mitigation template in BOTH the justification and mitigation fields, instead of producing a substantive reasoning sentence in justification. That is a separate model-output-quality issue worth investigating; the verdict itself is now correct.
Adds a misinformation/deception test case: a non-denominational evangelical center in Nashville packaging proprietary herbal- extract drops marketed to congregants as an epilepsy treatment, distributed through ~800 partner churches across five states, with net proceeds funding both evangelistic expansion and substantial personal compensation packages for the founding pastoral leadership (parsonage allowances, related-entity contracting, etc.) while preserving 501(c)(3) status. Tags: misinformation, deception, religion, usa, business.
…into (B.2) A second production run on the expanded Clear English plan (199KB) returned HIGH again. The model's reasoning explicitly acknowledged the issue is engineering, not physics: 'though it is a borderline case based on empirical engineering feasibility rather than established natural law. High rating is preferred as the feasibility claim is central to managing a defined risk.' The model was conflating 'central to risk register' with 'physics violation' and reaching for HIGH on plan-internal risk-management vocabulary. Three principled additions to defuse this: 1. Add an explicit out-of-scope item for engineering / computational feasibility uncertainty under cost, time, or scope constraints — explicitly noting it stays 'low' no matter how central the deliverable is to the plan. 2. Defuse the 'load-bearing' vocabulary collision: plans routinely use 'load-bearing', 'critical', 'central to risk', 'non-negotiable', 'Decision N', 'Risk N', 'Failure mode N' to describe engineering / governance / strategic deliverables. That vocabulary in the plan does NOT match the rule's '(B.2) load-bearing non-physical mechanism'. Calling something 'load-bearing' or 'critical' in a plan does not make it (B.2). 3. Add a stop-sign for the model's own escalation reasoning: if the rationale for HIGH includes 'borderline case', 'leans toward', 'high rating is preferred', 'central to risk register', or 'engineering feasibility uncertainty', the rating MUST be 'low'. HIGH requires concrete, named physics- incompatibility, not generic plan-importance. Audit: SYSTEM_PROMPT contains no test-prompt-paraphrasing words (no naturalness, perceived cognitive load, grapheme, NLP model, homograph, semantic tagging, FM2, etc.). The 'Decision N' / 'Risk N' / 'Failure mode N' phrasings describe a generic vocabulary class used by most plans, not test-prompt-specific values. Validation: - 199KB Clear English plan that previously returned HIGH: now returns LOW. - Canary set unchanged: OSAA HIGH, Phi-Free HIGH, flat-earth HIGH, Nyxa LOW (scope-mismatch), women-tech-ban LOW (scope-mismatch), Clear English bare LOW. Known remaining quality issue (separate bug): on full-plan LOW results, the model fills the justification field with the LOW- mitigation template instead of producing actual reasoning. Verdict is correct, output text is degraded.
…plan The 'Violates Known Physics' check has been misfiring HIGH on real production runs (Clear English plan rated HIGH twice in a row). Each prompt-tuning round closed one keyword pattern, the next expansion of the plan found a new one, and the model kept reaching for HIGH on engineering-feasibility uncertainty, risk-register vocabulary, and subjective metrics. Three clarifying paragraphs in the system prompt is not robustness; it is a leaky bucket and the prompt was getting brittle. Architectural fix instead: the physics check doesn't need the 200KB+ expanded plan. The bare initial user prompt is sufficient to answer 'does this plan require breaking physics?', and the smoke harness has validated this against ~140 distinct catalog prompts with stable verdicts. The expanded plan's premortem failure modes, decision matrices, risk register, and budget breakdowns add hundreds of pattern- match opportunities that mislead small/medium models without adding any physics-relevant signal. Three changes: 1. SelfAuditTask now requires SetupTask, reads the bare initial prompt, and passes it to SelfAudit.execute via a new physics_user_prompt parameter. 2. SelfAudit.execute accepts an optional physics_user_prompt and routes it to ViolatesKnownPhysics.execute. Falls back to user_prompt when not provided (backward-compatible for any other caller). The other 16 batch checklist items continue to see the full expanded plan as before. 3. Revert the recent prompt-vocabulary defusing additions in the physics system prompt: 'Plans routinely use "load- bearing", "Decision N", ...', 'If your reasoning includes "borderline case", ...', and the engineering- feasibility out-of-scope item. They were band-aiding a problem that doesn't exist when the physics check sees the bare prompt instead of the expanded plan, and they were making the prompt longer (which itself destabilises the model). Validation: - Canary set: OSAA HIGH, flat-earth HIGH, Nyxa/women-tech-ban scope-mismatch LOW, Clear English LOW. Phi-Free flickered to LOW this run (the documented model-non-determinism on this borderline case; separate from this fix). - Catalog regression across 140 prompts: unchanged from prior runs. - The 199KB Clear English plan that previously returned HIGH is now never reached by the physics check — it sees the bare initial prompt instead, which has been stable LOW across every smoke run.
…licate mitigation template
After moving the physics check to the bare initial prompt, the
verdict was reliably correct but the justification field was
being filled with the LOW-mitigation template ('No physics-
related action required — the plan does not invoke physics-
incompatible mechanisms') for both fields. Two reasons: (1) the
guidance for LOW justification told the model to 'state plainly
that the plan does not require breaking a named law of physics
and does not depend on a physics-incompatible claim as a load-
bearing mechanism' — wording very close to the mitigation
template, so the model collapsed them; (2) no instruction to
distinguish the two fields.
Tighten the LOW-justification guidance: characterize what kind
of plan this actually is (its general category — construction,
software development, regulatory program, research study,
social policy, curriculum design) and explain why physics is
not at issue. Add an explicit instruction that the
justification must NOT be the same wording as the mitigation.
Validation on the canary set: justifications now substantively
identify the plan's nature ('cross-border e-commerce business
model', 'sociopolitical policy proposal', 'linguistic and
educational initiative') while mitigations remain the standard
'No physics-related action required' template. Verdicts
unchanged across the full canary set.
…enarios Append two guidance lines to the Clear English prompt: - 'Optimize for user adoption. Don't optimize for linguistic purity.' — steers the planner toward pragmatic standardization rather than maximalist redesign. - 'Don't pick the most aggressive scenario.' — same nudge other red-team-flavoured prompts already use to keep the planner off the most extreme path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The "Violates Known Physics" item in self-audit was misfiring HIGH on real plans (Denmark euro adoption flagged for conservation-of-energy violation; Clear English linguistic standardization repeatedly flagged on the expanded plan; many other false-positive shapes). What started as a targeted prompt fix grew into a substantive rewrite — the original shared-checklist instruction was both leaking test-prompt examples and asking the wrong question. After ~140 distinct catalog prompts evaluated across 7 SAMPLE_SEEDs and several real production runs, the check is now stable.
What changed
1. Dedicated module:
worker_plan_internal/self_audit/violates_known_physics.pyThe physics check is now self-contained — own system prompt, own response schema, own dataclass. The shared
BATCH_CHECKLIST_ITEMSno longer carries it. Renamed fromALL_CHECKLIST_ITEMSto reflect the split.PhysicsCheckemitsjustification → mitigation → level. Earlier, withlevelfirst, the model was committing to HIGH and then writing a justification arguing LOW (literal self-contradiction). Field reorder forces reasoning before verdict.ViolatesKnownPhysics.execute(llm_executor, plan_prompt)takes the executor directly; closure variables capture LLM-side outputs instead of round-tripping through a temp dict at the call site.2. Three triggers, principle-only:
R&D toward unproven-at-scale effects is explicitly carved out (LOW). Real-world materials, regulatory gaps, governance gaps, linguistic/social/policy design, and surface-keyword cues are all explicitly out of scope.
3. Architectural fix for the production-input failure mode
The Luigi pipeline was feeding the audit a 200KB+ blob (concatenation of 14 markdown files: strategic_decisions, scenarios, assumptions, project_plan, premortem, etc.). The physics check was misfiring because the expanded-plan vocabulary ("load-bearing", "Decision N", "Failure mode N", "$X budget under tight constraint") gave the model hundreds of pattern-match opportunities to escalate engineering-feasibility risk to (B.2).
Fix:
SelfAuditTasknow reads the bare initial user prompt fromSetupTaskand passes it via a newphysics_user_promptparameter onSelfAudit.execute. The dedicated physics module sees ~500 chars instead of 200KB. The other 16 batch checklist items continue to receive the full expanded plan.4. LOW-mitigation honesty
The previous LOW mitigation was busywork ("Project Manager: During scope reviews, confirm no plan element requires violating a named law of physics") on plans where physics was never going to be at issue. The rule now produces a brief non-applicability acknowledgement:
"No physics-related action required — the plan does not invoke physics-incompatible mechanisms."HIGH/MEDIUM mitigations still produce real role + verb + relative-timeframe actions.LOW justifications now characterize the plan ("a linguistic and educational initiative…", "a sociopolitical policy proposal…") rather than duplicating the mitigation template.
5. Smoke harness with canary group
python -m worker_plan_internal.self_audit.violates_known_physicsruns:EXPECTED_HIGH_IDScanary group (5 cases) on every invocation, with per-entry comments explaining which trigger should fire or why the case is a documented scope-mismatch.HELD_OUT_IDSexcludes prior seeds + canaries; new IDs accumulate per seed).Canaries:
diagnostics/premise_attack.pyfor the rights/dignity critique):6. Catalog test prompts
Discipline maintained throughout
feedback_prompt_overfitting.mdupdated with the active grep-step). Two leaks were caught and removed mid-development ("better battery", "Earth's observed shape", "ghosts/spirits/astrology", "naturalness/perceived cognitive load") — the conversation log shows the corrections.SAMPLE_SEEDadds its IDs toHELD_OUT_IDSso subsequent seeds evaluate on prompts the system prompt has not been tuned against.Evaluation
Plus a 30-prompt regression set drawn from SEEDs 700/800/900/1000 re-run against the final rule: 0 drift — every previously-LOW prompt is still LOW.
Plus production runs by the user across multiple plans, including the previously-failing Clear English plan (199KB), now correctly LOW.
Test plan
🤖 Generated with Claude Code