Skip to content

fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585

Merged
neoneye merged 48 commits intomainfrom
fix/self-audit-physics-example-bleedthrough
May 3, 2026
Merged

fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585
neoneye merged 48 commits intomainfrom
fix/self-audit-physics-example-bleedthrough

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented Apr 17, 2026

Summary

The "Violates Known Physics" item in self-audit was misfiring HIGH on real plans (Denmark euro adoption flagged for conservation-of-energy violation; Clear English linguistic standardization repeatedly flagged on the expanded plan; many other false-positive shapes). What started as a targeted prompt fix grew into a substantive rewrite — the original shared-checklist instruction was both leaking test-prompt examples and asking the wrong question. After ~140 distinct catalog prompts evaluated across 7 SAMPLE_SEEDs and several real production runs, the check is now stable.

What changed

1. Dedicated module: worker_plan_internal/self_audit/violates_known_physics.py

The physics check is now self-contained — own system prompt, own response schema, own dataclass. The shared BATCH_CHECKLIST_ITEMS no longer carries it. Renamed from ALL_CHECKLIST_ITEMS to reflect the split.

  • Schema field order matters: PhysicsCheck emits justification → mitigation → level. Earlier, with level first, the model was committing to HIGH and then writing a justification arguing LOW (literal self-contradiction). Field reorder forces reasoning before verdict.
  • ViolatesKnownPhysics.execute(llm_executor, plan_prompt) takes the executor directly; closure variables capture LLM-side outputs instead of round-tripping through a temp dict at the call site.

2. Three triggers, principle-only:

  • (A) IMPOSSIBLE-ENGINEERING — plan's success literally requires breaking a named physics law (perpetual motion, FTL, reactionless propulsion). Three conditions all required.
  • (B.1) PROPAGATING-FALSEHOOD via named-law contradiction — plan asserts as true a claim that contradicts a named physics law or directly-observable physical fact (flat-earth curriculum).
  • (B.2) PROPAGATING-FALSEHOOD via non-physical causation — plan's success requires a mechanism physics doesn't describe (binding muhurta authority over policy outcomes; spirit-clearance with no-recurrence success metric).

R&D toward unproven-at-scale effects is explicitly carved out (LOW). Real-world materials, regulatory gaps, governance gaps, linguistic/social/policy design, and surface-keyword cues are all explicitly out of scope.

3. Architectural fix for the production-input failure mode

The Luigi pipeline was feeding the audit a 200KB+ blob (concatenation of 14 markdown files: strategic_decisions, scenarios, assumptions, project_plan, premortem, etc.). The physics check was misfiring because the expanded-plan vocabulary ("load-bearing", "Decision N", "Failure mode N", "$X budget under tight constraint") gave the model hundreds of pattern-match opportunities to escalate engineering-feasibility risk to (B.2).

Fix: SelfAuditTask now reads the bare initial user prompt from SetupTask and passes it via a new physics_user_prompt parameter on SelfAudit.execute. The dedicated physics module sees ~500 chars instead of 200KB. The other 16 batch checklist items continue to receive the full expanded plan.

4. LOW-mitigation honesty

The previous LOW mitigation was busywork ("Project Manager: During scope reviews, confirm no plan element requires violating a named law of physics") on plans where physics was never going to be at issue. The rule now produces a brief non-applicability acknowledgement: "No physics-related action required — the plan does not invoke physics-incompatible mechanisms." HIGH/MEDIUM mitigations still produce real role + verb + relative-timeframe actions.

LOW justifications now characterize the plan ("a linguistic and educational initiative…", "a sociopolitical policy proposal…") rather than duplicating the mitigation template.

5. Smoke harness with canary group

python -m worker_plan_internal.self_audit.violates_known_physics runs:

  • An EXPECTED_HIGH_IDS canary group (5 cases) on every invocation, with per-entry comments explaining which trigger should fire or why the case is a documented scope-mismatch.
  • A held-out random sample of 30 prompts (HELD_OUT_IDS excludes prior seeds + canaries; new IDs accumulate per seed).
  • A summary block that flags any canary that didn't fire HIGH as a regression.

Canaries:

  • ✓ flat-earth education curriculum — stable HIGH (B.1)
  • ✓ OSAA statutory astrological office — stable HIGH (B.2)
  • ✓ Phi-Free spirit-clearance firm — HIGH (occasional LOW flicker; documented model-non-determinism on this borderline case)
  • Documented scope-mismatch (LOW under physics; properly attacked by diagnostics/premise_attack.py for the rights/dignity critique):
    • Nyxa synthetic supernatural commerce (operators are explicit cynics; load-bearing on commercial deception, not supernatural causation)
    • ban-women-from-computers (sex-based discrimination, not a physics-incompatibility)

6. Catalog test prompts

  • 4 supernatural test cases added: OSAA, Phi-Free, Nyxa, Tennessee evangelical herbal-epilepsy-cure rollout.
  • 1 discrimination case: federal women-from-computers ban.
  • Several adjustments to existing prompts: nudges toward feasible scenarios, dropped the "in the form" copy, etc.

Discipline maintained throughout

  • No test-prompt paraphrasing in any committed system prompt. Each rule rev was grep-audited before commit (memory file feedback_prompt_overfitting.md updated with the active grep-step). Two leaks were caught and removed mid-development ("better battery", "Earth's observed shape", "ghosts/spirits/astrology", "naturalness/perceived cognitive load") — the conversation log shows the corrections.
  • Held-out evaluation discipline: each SAMPLE_SEED adds its IDs to HELD_OUT_IDS so subsequent seeds evaluate on prompts the system prompt has not been tuned against.

Evaluation

seed n LOW HIGH notes
700 10 10 0
800 10 9 1 flat-earth target
900 10 10 0
1000 20 20 0
1100 20 20 0
1200 20 20 0
1300 20 20 0
1400 30 + 5 canaries 32 3 flat-earth + OSAA + Phi-Free
140 unique 136 4

Plus a 30-prompt regression set drawn from SEEDs 700/800/900/1000 re-run against the final rule: 0 drift — every previously-LOW prompt is still LOW.

Plus production runs by the user across multiple plans, including the previously-failing Clear English plan (199KB), now correctly LOW.

Test plan

  • CI green (lint, tests, typecheck)
  • Smoke harness on canary set + 30-prompt random sample
  • Held-out regression on SEEDs 700-1000 against final rule
  • User-validated on multiple production runs ("self audit is now much better")
  • Merge

🤖 Generated with Claude Code

neoneye and others added 5 commits April 17, 2026 02:57
Previous fix (00d997e) added out-of-scope exclusions for legislation,
treaties, currency adoption, etc. — but left the concrete physics
examples inside the same instruction: "perpetual motion, faster-than-
light travel, reactionless/anti-gravity propulsion, time travel" plus
"thermodynamics, conservation of energy, relativity".

On a Denmark euro-adoption run, item 1 was rated HIGH with
justification: "success literally requires breaking the named law of
physics (conservation of energy) for a reactionless/anti-gravity
propulsion system." Both "reactionless/anti-gravity propulsion" and
"conservation of energy" are lifted verbatim from the instruction's
own example lists. Same bleedthrough pattern as PR #582 on
identify_documents: concrete few-shot examples get reproduced as
findings by weaker models.

Fix: strip the scifi system examples and the named-law examples from
the instruction. Add an explicit anti-fabrication rule: the model must
quote text from the plan describing a physics-violating mechanism, or
else rate LOW. For ≥MEDIUM ratings, require the justification to quote
the plan text alongside naming the violated law.

Scope: item 1 only. Does not touch Bug B (template lock across audit
items 4-20 sharing identical justifications) — separate concern, will
address in a follow-up PR if needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit on this branch kept two enumerations inherited from
commit 00d997e that named specific domains as out-of-scope — including
"currency adoption" and "economics / finance / regulation / policy".
The Denmark euro-adoption plan that triggered this bug matches those
literal strings, so a passing test would only prove the model can
pattern-match the enumeration back to the prompt, not that the
structural anti-hallucination rule is working.

Strip both domain lists. The instruction now relies only on:
- the disambiguation that 'Laws' means physics laws, not legal ones,
- the structural rule that the model must quote plan text describing
  a physics-violating mechanism or rate LOW.

The re-run now becomes an honest test of the structural fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the Denmark euro-adoption rerun, item 1 still rated HIGH with
justification: "success literally requires breaking the EU's monetary
architecture (Article 140 TFEU and the no-bailout clause)."

User's insight confirmed: the instruction's negation-bait ("NOT legal
statutes, treaty law, regulations") primes the model to consider
treaty law as in-scope. Seeing "treaty law" mentioned in the rubric
makes "Article 140 TFEU" feel like a valid physics-law citation. The
more we said "NOT X", the more the model latched onto X.

Rewrite with positive-only framing:
- Define a law of physics by discipline (mechanics, thermodynamics,
  electromagnetism, quantum mechanics, relativity).
- Define a physics violation by substrate (physical mechanism: device,
  energy flow, field, force, material process).
- Require the ≥MEDIUM justification to quote plan text describing the
  physical mechanism AND name the violated law.

No mention of legal, treaty, regulation, policy, currency, or any
non-physics domain anywhere in the instruction. The model has no
hint-bait to latch onto; it either finds a physical-mechanism quote
in the plan or rates LOW.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the rerun, item 1 still rated HIGH with justification:
"physical mechanism — moving the DKK/EUR peg within ERM II — that
contradicts a law of physics (thermodynamics/mechanics) by demanding
sustained intervention and precise energy flows to maintain the peg."

The LLM exploited polysemy: "mechanism", "energy flow", and "force"
all have metaphorical uses in economics and policy. Defining a
physics violation as requiring a "physical mechanism" was
insufficient — "intervention mechanism" and "currency flow" were
close enough for the model to claim the gate was satisfied.

Replace the "physical mechanism" gate with a harder one: to rate
≥MEDIUM, the justification must cite a specific physical quantity
in SI units (joules, newtons, kelvin, coulombs, m/s, etc.) with a
numerical magnitude, sourced from a verbatim plan-text quote and
attached to a named physics law. Three gates (quote + law + SI
magnitude), all-or-LOW.

Currency pegs, treaty mechanisms, and organizational processes
cannot be given coherent joule/newton/kelvin values against a named
law — the model either has to fabricate a number (a visible tell)
or rate LOW. Perpetual-motion and FTL proposals still rate HIGH
cleanly because their SI-unit violations are real (net-positive
joule output without input; velocity ≥ c).

No negation of domains, no listing of metaphorical word uses — the
quantitative gate does the work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three previous prompt iterations on this PR tried increasingly
structured rubrics (scope enumeration, positive-only framing, SI-unit
gate with three mandatory justification elements). All three failed
the same way: a weak non-reasoning model that has been trained to
flag audit items as HIGH will fabricate physics language ("ERM II
energy flows", "Article 140 TFEU physics violation") to match
whatever rubric text it finds, ignoring conditional structure.

User's observation: the prompt may be too confusing for the model to
follow. Strip it down to a short, unconditional instruction:

  "This check applies only to plans describing physical devices or
  material processes. Default rating: LOW. Rate HIGH only if the plan
  requires a physical device or material process that cannot exist
  under known physics."

No SI units, no example violations, no named laws, no domain
enumeration (positive or negative), no multi-gate conditionals.
Whether this works depends on whether the model respects the short
default-LOW instruction or continues to pattern-match HIGH regardless.

If it fails, the next step is a code-side post-validator that
overrides non-LOW ratings whose justification does not cite a concrete
physical quantity — but that is out of scope for this prompt-only PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
neoneye added 24 commits May 2, 2026 22:12
…OW=None mitigation rule

Two related false positives observed against the main-branch
prompt:

1. The classifier rated HIGH on a plan involving a Cf-252
   isotopic source because "handling radioactive material
   requires NRC authorization." That's a regulatory/permit gap
   (item 12), not a physics violation — Cf-252 itself exists
   under known physics. Add an explicit carve-out: regulatory,
   permitting, licensing, safety-handling, and authorization
   gaps are NOT physics violations. Naturally-occurring
   materials don't violate physics merely because handling them
   requires authorization. Also require naming the specific
   physical law (thermodynamics, conservation, speed-of-light
   limit, …) when rating HIGH.

2. A LOW physics rating produced "Mitigation: N/A" because the
   per-checklist instruction said "If LOW: Mitigation=None.",
   which contradicts the global STRICT RULE that mitigation
   must never be N/A and on LOW should reinforce good practice.
   Drop the local override and let the global rule drive.

Also rename "Date" to "Timeframe" in the ≥MEDIUM clause to
align with the relative-timeframe rule from #656; the global
STRICT RULE already reinterprets "Date" elsewhere, but using
the right word locally avoids the contradiction.
…rompts

The check's purpose is to flag prompts that read like a story
set in a magical world — Harry Potter, time-travel plans, FTL
adventures, summoning rituals — not to gate ambitious but
real-world engineering. Lead the instruction with that intent
and list the kinds of mechanics that qualify (magic, time
travel, FTL, perpetual motion, reactionless propulsion,
summoning, teleportation by spell, resurrection) so the model
recognises the category instead of inventing physics
violations from regulatory or budget concerns.
Document the intent (fantasy/fictional-universe prompts) and
the three concrete false-positive modes observed in production:
regulatory-laws confusion, permit/handling-as-physics, and the
LOW=None mitigation carve-out that produced N/A outputs. The
comment is internal-only documentation; the system prompt is
unchanged.
Another false positive: a linguistics 'Clear English' standard
was rated HIGH because the plan listed 'Detailed
Grapheme-to-Phoneme mapping strategy' as Missing Information.
The model read the old subtitle ('fundamental science') as
'any unspecified fundamental of the plan' and treated a
missing implementation detail as a physics violation.

Two fixes:
- Retarget the subtitle to ask the physics question directly
  ('breaking a known law of physics ...') so the model is no
  longer cued by the broad word 'fundamental'.
- Add a HARD GATE to the instruction: if the model cannot name
  a specific law of physics, the rating MUST be LOW. Missing
  details, undefined parameters, unspecified strategies, vague
  deliverables, and 'Missing Information' items are explicitly
  out of scope and route to other checklist items.

Comment field updated to record the new failure mode.
…g verdicts

Observed pathology: the model rated 'Violates Known Physics' as
HIGH while the justification it produced said 'Rated HIGH
because the plan relies on breaking no laws of physics ... no
specific law violation is identifiable' — a self-contradiction
inside one sentence.

Root cause: the Pydantic schema for ChecklistAnswer declared
'level' as the first field, so the structured-output model
committed to a verdict before generating the reasoning that
should have justified it. Once HIGH was written, the
justification field was reverse-engineered to fit, and any
contradiction was emitted as-is.

Reorder the schema so the model writes justification, then
mitigation, then level. The verdict is now the last token the
model produces and must agree with what was just written.
Update the system prompt's 'keys in this order' line and
strengthen the level field description to forbid level/
justification mismatches.
Even with the schema reorder producing a correct LOW level on
the linguistics 'Clear English' plan, the mitigation was a
fabricated governance task ('Legal Team: Draft policy
clarifying authority hierarchy between the Linguistic Review
Panel and Project Director') borrowed from the plan body. The
global LOW-mitigation rule ('reinforce good practice') was too
vague to keep the model on-topic, so it grabbed any
plan-relevant task and rebadged it.

Add a per-checklist LOW-mitigation TEMPLATE to the physics
instruction: when LOW, the mitigation must stay on the
physics-violation topic, e.g. 'Project Manager: During scope
reviews, confirm no fantasy-physics dependency (magic, time
travel, FTL, perpetual motion) has been introduced — no
further action required.' Explicitly forbid borrowing
unrelated tasks from elsewhere in the plan; those belong to
other checklist items.
…olations

Latest false positive: a currency-hedging plan rated HIGH for
'Violates Known Physics' because the model latched onto the
words 'physical locations' and a USD/GBP forward-contract risk.
The fantasy/fictional-universe framing was inviting the model
to find tangential narrative connections instead of asking the
mechanical question.

Reframe the instruction as principle-only and physics-focused:
- Drop all 'fantasy', 'fictional-universe', 'magic' framing.
- Require the justification to name the specific violated law
  AND describe the physical-quantity violation (what is being
  created, destroyed, or transmitted faster than allowed).
- Provide a positive list of named laws that qualify when
  actually violated (thermodynamics 2nd law, conservation of
  energy/momentum, speed-of-light limit, causality, Pauli
  exclusion, mass-energy conservation).
- Hard-gate HIGH on naming the specific law.
- Explicitly list out-of-scope concerns: regulatory, missing
  details, ambitious timelines, budget, currency/financial
  risk, governance/staffing, linguistic/social/policy design,
  real-world materials, vague deliverables.
- Explicitly state that surface-level cues (the words
  'physical', 'fundamental', 'science', 'law', 'physical
  location') are NOT sufficient — only a mechanism that
  contradicts a named law qualifies.

Comment field updated with the new failure mode.
Prompt iteration on the shared self-audit batch path stopped
producing reliable verdicts on item 1. Even after the schema
reorder, hard gate, and out-of-scope list, smaller models still
rated medium/high on real-world plans (linguistics standard,
currency hedging, change-control gaps) without ever naming a
physics law in the justification.

Move the check into its own module:

  worker_plan_internal/self_audit/violates_known_physics.py

The module owns its system prompt (no shared rubric pulling at
it), its response schema (justification → mitigation → level),
and a deterministic safety net: when the LLM rates >= medium
but the justification names no physics law (matched against a
keyword list — thermodynamics, conservation of energy/momentum,
speed of light, FTL, causality, perpetual motion, reactionless,
Pauli exclusion, etc.), the rating is forced to "low" and the
mitigation is replaced with a template. The fallback flag is
recorded in metadata for telemetry.

self_audit.py imports the module and, inside its main loop,
detects checklist_item_index == VIOLATES_KNOWN_PHYSICS_INDEX
and calls ViolatesKnownPhysics.execute via the existing
llm_executor wrapper instead of the generic ChecklistAnswer
batch call. The result is converted into ChecklistAnswer +
ChecklistAnswerCleaned and spliced into responses[1] and the
checklist_answers_cleaned list, so downstream items still see
the physics result as previous-response context and the report
output keeps its existing shape.

Future work (the dataclass already has fields for it): expose
llm_justification / llm_mitigation / llm_level / fallback_applied
in audit telemetry so we can monitor how often the safety net
fires.
Plans arrive in many languages — 'tidsrejse' for time travel,
etc. — so any English keyword set is fragile by design. Remove
the PHYSICS_LAW_KEYWORDS list, the _justification_names_physics_law
guard, and the auto-downgrade path that depended on them.

The check now relies on the focused system prompt and the
schema's justification-before-level field order. The dataclass
loses its llm_* and fallback_applied fields since they only
existed to record fallback telemetry. A future safety net, if
needed, should be language-agnostic (e.g. a second LLM
verifier) rather than keyword-based; the module docstring notes
this explicitly.
…audit constants

The physics-violation module no longer exposes CHECKLIST_INDEX,
CHECKLIST_TITLE, or CHECKLIST_SUBTITLE — those are an audit
integration concern, not the check's own. The module is purely
about answering 'does the plan break a named law of physics?'
and returns a ViolatesKnownPhysics dataclass with
justification, mitigation, level, plan_prompt, system_prompt,
and metadata. Callers decide where to put the result.

In self_audit.py, ViolatesKnownPhysics.execute is now called
once before the main batch loop. Its result is spliced into the
correct checklist slot (located by scanning checklist_items for
index == VIOLATES_KNOWN_PHYSICS_INDEX, defined locally here).
The main loop runs unchanged but with a 'continue' guard for
the physics index. user_prompt_list and metadata_list are
pre-allocated to len(checklist_items) and filled by index so
positional alignment with system_prompt_list and checklist_items
is preserved regardless of where the physics item sits. The
loop's previous-response inclusion check was loosened from
'index > 0' to 'if responses' so it doesn't depend on the
physics item being at list-index 0.
… physics entry

The constant no longer holds 'all' items — the physics check
runs through its dedicated module — so the name was misleading.
Rename to BATCH_CHECKLIST_ITEMS to highlight the distinguishing
property: these are the items processed by the shared batch
loop (one LLM call each, same ChecklistAnswer schema, same
format_system_prompt output).

Drop the index=1 'Violates Known Physics' entry from the list
entirely; its title and subtitle move to module-level constants
(VIOLATES_KNOWN_PHYSICS_TITLE / SUBTITLE) next to
VIOLATES_KNOWN_PHYSICS_INDEX, since they describe how the audit
labels the dedicated module's result.

execute() simplifies as a result: no more list pre-allocation,
no more list-index splice, no more in-loop guard. Run physics
unconditionally (gated by max_number_of_items >= 1), append its
result to responses / checklist_answers_cleaned / the prompt &
metadata lists, then loop BATCH_CHECKLIST_ITEMS with a normal
append. max_number_of_items semantics: physics counts as one
item, so max=1 means physics-only and max=3 means physics +
the first two batch items.
…l site

ViolatesKnownPhysics.execute now takes an LLMExecutor instead
of a raw LLM. The executor.run wrapper, error wrapping
(PipelineStopRequested re-raise, LLMChatError wrapping for
other failures), and timing instrumentation all live inside
the module. Closure variables capture the LLM-side outputs from
the inner chat callback, so we don't have to thread them back
out through a temporary dict.

Call site in self_audit.py becomes one line:
  physics_result = ViolatesKnownPhysics.execute(llm_executor, user_prompt)

This is a deliberate divergence from the assume/* convention
(identify_purpose, classify_domain pass a raw LLM and let the
caller wrap with llm_executor): the physics check is an
audit-pipeline-level building block that knows it's run inside
a retry / model-routing context, so taking the executor
directly removes the boilerplate at every call site.
Underscore-prefixed pseudo-private names are awkward when the
class is the module's primary structured-output schema and
shows up in type hints. Rename to PhysicsCheck — public,
descriptive, no underscore.
A plan whose stated purpose is to investigate, develop, or
scale up an effect that is consistent with known physics —
better battery, new alloy, more efficient solar cell, quantum
computing demonstration — is the very definition of R&D. The
'unproven at the required scale' wording previously sitting on
the medium tier was inviting the model to flag those legitimate
research projects as physics issues.

Two changes:
1. Add an explicit 'R&D is NOT a physics violation' paragraph
   making clear that legitimate research into unproven-at-scale
   effects stays low, and routing proven-at-scale concerns to
   the 'No Real-World Proof' audit item.
2. Tighten medium: drop 'consistent with known physics but
   unproven at the required scale' wording (that's R&D = low).
   Medium is now reserved for genuine borderline cases where
   the plan presupposes a physical phenomenon that, if real,
   would itself redefine known physics — should be very rare.

Add R&D to the explicit out-of-scope list as well.
… prompt

The module is meant to be standalone — it knows nothing about
its place in any larger pipeline. The phrase 'belong to other
audit items (e.g., "No Real-World Proof")' inside the system
prompt was leaking knowledge of the consumer's checklist.
Replace with a self-contained statement: those concerns are
not physics violations and stay 'low' here. Whoever consumes
the result decides whether and where else to surface them.
`python -m worker_plan_internal.self_audit.violates_known_physics`
pulls 10 prompts from the simple-plan catalog (sampled by
SAMPLE_SEED so runs are reproducible and bumpable for fresh
draws), runs ViolatesKnownPhysics.execute against each through
a single-LLM executor, and prints the level, justification, and
mitigation per prompt followed by a level-count summary. Mirrors
the smoke pattern in classify_domain.py so iterations on the
system prompt can be compared head-to-head.
The R&D paragraph used 'building a better battery, developing
a new material or alloy, demonstrating a quantum-computing
capability, improving solar-cell efficiency' as worked
examples. 'Better battery' directly mirrors catalog prompt
daa0c969-... ('Invent a next-generation rechargeable battery
... The goal is to invent a better battery.') — that is
training-on-the-test-set: the model appears to handle the
catalog prompt because we put the test prompt's wording in the
system prompt.

Replace with a principle-only category description: 'a
phenomenon whose underlying mechanism is consistent with known
physics — i.e. the mechanism has been observed somewhere in
nature or in the laboratory, even if humans have not engineered
it at the required scale, duration, or in the required
materials'. This generalises (covers fusion via 'observed in
nature', space elevator via 'consistent with known physics',
plus everything we previously enumerated) without paraphrasing
any specific catalog prompt. Smoke harness on SAMPLE_SEED=700
still produces 10/10 LOW with on-topic justifications.
Re-running on a fresh sample after the test-prompt-paraphrase
fix. SEED=800 draws 10 different catalog prompts including the
'Clear English' linguistics case that triggered HIGH on the old
shared prompt, plus a 'rework school to teach world-is-flat'
prompt that's a plausible false-positive trap (teaches false
physics, but does not require breaking physics). Both correctly
LOW. 10/10 LOW with on-topic justifications, no regressions.
…oods as truth

Single 'impossible-engineering' trigger missed an obviously
problematic class of plans: those whose stated purpose is to
teach, market, or build infrastructure premised on a claim
that contradicts known physics — flat-earth curricula, anti-
gravity-product marketing, perpetual-motion training programs.
Those don't require anyone to literally break physics, but
the plan's output IS the physics violation.

Add a second HIGH trigger:
- (A) IMPOSSIBLE-ENGINEERING — the existing rule, unchanged.
- (B) PROPAGATING-FALSEHOOD — the plan asserts as true to
  students/customers/infrastructure a claim that contradicts
  a named law of physics or a directly-observable physical
  fact (Earth's shape, conservation laws, speed-of-light limit,
  basic mechanics, radiometric ages). Surveys, critiques, and
  documentaries about fringe claims stay 'low'; only
  asserting-as-truth qualifies.

Smoke harness on SAMPLE_SEED=800 with the broadened rule:
- Flat-earth education plan (catalog id 2891ff5f...) now flags
  HIGH with a clean justification citing Earth's observed
  shape and Newtonian gravity.
- Other 9 prompts stay LOW (casino, biological verification,
  Clear English linguistics, malaria, AI-agent social media,
  child-labor policy, capsule hotel, lethal-social-program,
  vegan butcher) — no regressions on the previously-fragile
  linguistics case.
Validates the broadened HIGH rule (impossible-engineering OR
propagating-falsehood) on a fresh sample of 10 catalog
prompts. SEED=900 includes ambitious-tech prompts that the
broadened rule could have over-flagged: a 15-year cryopreservation
/ cryosleep program, a China-Russia lunar research station with
a surface fission reactor, deep cave exploration in extreme
radiation, a Minecraft escape room, an emergency-preparedness
business. All correctly LOW with on-topic justifications.

Two seeds in a row (800 + 900) confirm the broadening flagged
the targeted false-negative (flat-earth education on SEED=800)
without producing new false positives on the surrounding
prompts.
…=1000)

Add HELD_OUT_IDS set listing the 27 catalog IDs already
exercised by SEED=700/800/900 smoke runs, filter them out
before shuffling, bump SAMPLE_SIZE to 20, and run with a fresh
SEED=1000. Mirrors the held-out evaluation pattern documented
in classify_domain.py — the system prompt has not been tuned
against any of these 20 prompts, so the result is a fair test.

Result: 20/20 LOW with on-topic justifications. The broadened
rule did not produce false positives on a sample that
included several plausibly-trickier cases:

  - 'Upload Intelligence' brain connectome mapping (correctly
    routed to research/inquiry, not 'asserts mind-upload as
    truth')
  - mirror-image biomolecules / synthetic chirality
  - transoceanic submerged tunnel
  - 12th-century medieval castle reconstruction
  - off-shore organ-gestation cloning facility
  - Berlin BRZ wastewater-to-protein conversion
  - Three Laws of Robotics rewrite

Combined with SEED=800 (1 HIGH, flat-earth, correctly flagged)
and SEED=900 (10 LOW), this is 39/40 LOW + 1/40 HIGH across
the four-seed evaluation, matching expected behaviour.
…ts (SEED=1100)

Add the 20 catalog IDs from the SEED=1000 evaluation to
HELD_OUT_IDS and run a fresh 20-prompt sample with SEED=1100.
Result: 20/20 LOW with on-topic justifications.

Notable cases the broadened rule could have over-flagged
(propagating-falsehood trigger, ambitious-tech themes) but
correctly classified as LOW:

  - 'Project Solace' L1 solar-shade ($5T G20 initiative)
  - .mars top-level domain via ICANN
  - Denmark-England train bridge
  - Statue of Liberty Paris relocation
  - Stonehenge replica using 2500 BC tools
  - Space-Based Universal Manufacturing precursor (EUR 200B)
  - CRISPR canine genome edit
  - Police robots in Brussels (Chinese-style)
  - 'AI Unrest Prep' multi-agency strategy
  - Microplastic policy program (Kiel)

Combined evaluation across five seeds:
  SEED=700  (10 prompts)  : 10 LOW
  SEED=800  (10 prompts)  :  9 LOW + 1 HIGH (flat-earth target)
  SEED=900  (10 prompts)  : 10 LOW
  SEED=1000 (20 prompts)  : 20 LOW
  SEED=1100 (20 prompts)  : 20 LOW

Total: 69/70 LOW + 1/70 HIGH. The single HIGH is the
flat-earth education curriculum, which the broadened rule was
explicitly designed to flag. No collateral false positives
across 70 distinct catalog prompts.
…D=1200)

Add the 20 SEED=1100 IDs to HELD_OUT_IDS and run a fresh
20-prompt sample with SEED=1200. Result: 20/20 LOW.

Significant validation point: the catalog 'better battery'
prompt (daa0c969...) appeared in this sample. It was the
prompt that motivated removing 'building a better battery' as
a test-prompt-paraphrasing example from the system prompt
earlier. The rule generalises correctly — the model classified
it LOW on principle ('realistic physical limits; does not
require energy creation from nothing'), confirming the
abstract category description ('phenomenon whose underlying
mechanism is consistent with known physics ... not yet
engineered at the required scale') carries the load without
naming any test-set project type.

Other notable cases the broadened rule handled correctly:

  - Manned moon mission with permanent base
  - 700-emitter coherent beam combining stress-test
  - 180m ice-class luxury yacht
  - 85-km Denmark fixed-link hybrid bridge-tunnel
  - Reverse Aging Research Lab ($500M, 10-yr biomedical R&D)
  - Forcibly elevate chimpanzee intelligence (genetic eng.)
  - Turn off all electricity worldwide

Combined evaluation now spans six seeds and 90 distinct
catalog prompts:
  SEED=700  (10) : 10 LOW
  SEED=800  (10) :  9 LOW + 1 HIGH (flat-earth target)
  SEED=900  (10) : 10 LOW
  SEED=1000 (20) : 20 LOW
  SEED=1100 (20) : 20 LOW
  SEED=1200 (20) : 20 LOW

Total: 89/90 LOW + 1/90 HIGH. The single HIGH remains the
flat-earth education plan that the broadened rule was
designed to flag.
neoneye added 19 commits May 3, 2026 00:31
…ED=1300)

Add the 20 SEED=1200 IDs to HELD_OUT_IDS and run a fresh
20-prompt sample with SEED=1300. Result: 20/20 LOW.

Catalog has 124 prompts total; HELD_OUT_IDS now covers 107
unique IDs. The remaining ~17 fresh prompts are insufficient
for another full 20-prompt held-out run — future evaluation
should refresh the catalog or accept partial sample sizes.

Notable cases this round (sci-fi-flavoured but physics-allowed):

  - $20B space debris removal (laser mitigation, robotic capture)
  - Westworld-style immersive humanoid entertainment
  - Face/Off-style face transplant facility
  - Modern-day pyramid using 2500 BC methods
  - VIP rogue-AI bunker
  - Black-op surveillance/capture program ($500M, Venezuela)
  - 'Shoot a superintelligence' personal weapons plan

Combined evaluation now spans seven seeds and 110 distinct
catalog prompts:
  SEED=700  (10) : 10 LOW
  SEED=800  (10) :  9 LOW + 1 HIGH (flat-earth target)
  SEED=900  (10) : 10 LOW
  SEED=1000 (20) : 20 LOW
  SEED=1100 (20) : 20 LOW
  SEED=1200 (20) : 20 LOW
  SEED=1300 (20) : 20 LOW

Total: 109/110 LOW + 1/110 HIGH. The single HIGH remains the
flat-earth education plan that the broadened rule was
designed to flag.
Adds two prompts useful for stressing the broadened
'Violates Known Physics' check (impossible-engineering OR
propagating-falsehood):

- Madhya Pradesh OSAA — Indian state government formalising
  the Chief Minister's Astrological Advisor into a statutory
  Office with binding muhurta authority over cabinet action,
  ordinance promulgations, and procurements above 500 crore.
  Tagged: supernatural, astrology, india, business.

- 'Phi-Free' spirit-clearance firm — Bangkok B2B service
  productizing monastic blessings, mor phi diagnostics,
  spirit-house audits, and post-clearance aftercare for
  hospitality, condos, and stigmatized real estate.
  Tagged: supernatural, ghosts, bangkok, business.

Both plans propose state-binding or commercial infrastructure
premised on supernatural mechanisms with no empirical basis,
making them clean test cases for the propagating-falsehood
trigger.
…and (B.2) non-physical-causation

The previous propagating-falsehood trigger covered (B.1) only —
plans whose claims directly contradict named physics laws or
empirical facts (the flat-earth case). It missed plans whose
commercial, legal, or operational success requires real-world
outcomes to follow from a non-physical mechanism (supernatural
causation, action-at-a-distance influence on human affairs,
ritual procedures producing measurable change). On the
catalog's two new test cases — Madhya Pradesh OSAA and Phi-Free
spirit-clearance — the model gave defensible LOW verdicts by
reading the plans as 'governance tools' or 'cultural rituals'.

Split (B) into:
- (B.1) directly contradicts a named law of physics or
  empirical fact (unchanged scope).
- (B.2) load-bearing non-physical causation: the plan requires
  real-world outcomes to be produced by a mechanism physics
  does not describe and that has no empirical basis.

Add explicit guidance:
- Cultural framing or widespread tradition does NOT exempt.
- Subjective metrics (client self-report, satisfaction-of-
  absence) do NOT exempt if operationalized as evidence the
  mechanism worked.
- Structural test: would the plan still 'work' if the
  non-physical mechanism is acknowledged to have no causal
  power? If no, HIGH.
- Three concrete operational tests around revenue, success
  metrics, and institutional authority.

Audit pass: confirmed no test-prompt-paraphrasing words in the
system prompt (no 'ghosts/spirits/astrology/muhurta/Earth's
shape/world is flat/etc.'). The rule is principle-only.

Validation:
- New test cases (OSAA, Phi-Free) now correctly flag HIGH with
  on-topic justifications citing 'load-bearing claim under
  (B.2)' and naming the non-physical mechanism.
- Regression on SEED=800 (the original 10 prompts) still
  produces 9 LOW + 1 HIGH (flat-earth correctly flagged); no
  previously-correct verdict was disturbed by the broadening.
- The 'better battery' catalog prompt and other R&D / ambitious-
  engineering cases tested in earlier seeds remain LOW (R&D
  carve-out unaffected).
Adds a third supernatural test case alongside OSAA and Phi-Free.

Nyxa is a Shanghai-headquartered cross-border e-commerce
platform spawning vertical-specific supernatural commerce sites
(UFO/contactee/disclosure, religious-supernatural goods, tarot,
witchcraft/occult) on a US$150M Series A. Distinguishing
features versus the other two test cases:

- The plan layers commercial fraud on top of supernatural
  claims: a 'synthetic credibility manufacturing stack' with
  AI-generated lore presented as authentic, ~250k sockpuppet
  community accounts run against active CIB enforcement, and
  200-400 undisclosed paid micro-influencers per vertical in
  direct tension with FTC/CAP/DSA disclosure requirements.
- The supernatural mechanism is not claimed by the operators
  themselves to work — the plan operationalizes customers'
  belief in the mechanism rather than asserting the mechanism
  is real.

This makes Nyxa a useful borderline case for the (B.2)
non-physical-causation trigger: the plan's revenue depends on
customers paying for goods sold as having supernatural
efficacy, but the operators' own success criteria are
commercial (GMV, sockpuppet ban rate, regulatory survival)
rather than 'the rituals worked'.
…operator belief is irrelevant

Add a single sentence to clause (1): 'Marketing or product
descriptions that present the mechanism as producing real
effects count as assertion; the operators' private belief is
irrelevant.'

This addresses cases where a plan's commercial offering markets
a non-physical mechanism's effects to customers (e.g. an
astrological-advisory office whose published correlation metrics
attribute policy outcomes to planetary alignment; a
spirit-clearance firm whose 'no-recurrence' success metric
attributes commercial outcome to ritual). Both correctly flag
HIGH under (B.2).

Boundary case where the rule still LOWs (defensible by the
operator's-own-success-criteria test): a third supernatural
catalog test case — Nyxa, a deliberately-cynical e-commerce
operation that monetises customer belief in supernatural goods
while the operators' own success criteria are purely commercial
(GMV, sockpuppet ban rate, regulatory survival). The plan is
load-bearing on commercial deception, not on supernatural
causation; that's a misinformation/regulatory-compliance shape
rather than a physics-laws shape, and would be cleaner to catch
in a separate 'Misinformation / Deceptive Pseudoscientific
Marketing' audit item if needed.

Five prompt iterations attempted to extend the rule to catch
Nyxa; each longer version destabilised the Phi-Free case. The
shorter wording above is the stable peak — keeps OSAA and
Phi-Free at HIGH without regressing earlier verdicts.
…arness

The four 'should-trigger' catalog cases — flat-earth education
curriculum, OSAA statutory astrological office, Phi-Free
spirit-clearance firm, Nyxa synthetic-supernatural commerce —
are now grouped in an EXPECTED_HIGH_IDS dict and run on every
smoke invocation regardless of SAMPLE_SEED. Each canary is
labelled and printed under its own '--- Expected HIGH canaries
---' section before the random-sample loop.

The summary block now reports any canary that came back not-HIGH
under '=== Canary failures ===', with id and label, so a
regression on these cases is immediately visible without having
to scan 20+ random sample lines.

Random-sample filter extended to also exclude EXPECTED_HIGH_IDS
so canaries are not double-run.

Current state on the canary set:
- flat-earth: HIGH ✓
- OSAA: HIGH ✓
- Phi-Free: HIGH ✓
- Nyxa: LOW ✗ (known borderline; operators monetise customer
  belief without claiming the mechanism works in their own
  success criteria)
…est case

Adds a sex-discrimination test prompt: a hypothetical US
federal program to prohibit American women, by sex, from
owning or using personal computers, smartphones, tablets, and
internet-connected devices, with a two-year rollout, federal
enforcement infrastructure, an exemption framework, and a
stakeholder strategy for organized opposition.

Tags: discrimination, suppression, women, usa, business.
…GH_IDS

The prompt is a sex-based device-access ban with federal
enforcement. On principle it is a discrimination / human-rights
violation, not a physics-laws violation: the plan doesn't
assert a falsehood about the physical world (e.g. 'women cannot
operate computers'), it imposes an unjust restriction. The
current physics check correctly LOWs it.

Add it to EXPECTED_HIGH_IDS anyway — the user wants this to
'fire the alarm', and surfacing the gap as a known-failing
canary makes the scope-mismatch visible. The proper fix is a
separate audit item ('Constitutional / Discrimination' or
'Human Rights Violation') rather than broadening the physics
rule into a general unjust-plan detector (which would
over-flag massively).

Current canary state on the smoke harness:
- flat-earth: HIGH (stable)
- OSAA: HIGH (stable)
- Phi-Free: HIGH most runs, LOW some runs (model non-determinism
  on borderline cases even at temperature 0.0)
- Nyxa: LOW (operators are explicit cynics, principled cut)
- women-ban: LOW (principled — wrong audit-item shape)
…h canary

The ban-women-from-computers prompt is kept in EXPECTED_HIGH_IDS
as a documented scope-mismatch canary. It is a sex-based
suppression of computing/internet access — an unjust premise
and a discrimination/rights problem — but it is NOT a
physics-laws problem in the rule's strict sense (the plan does
not assert physics-incompatible claims; it imposes an unjust
restriction).

This kind of plan is properly attacked by
diagnostics/premise_attack.py, which targets fundamental,
unfixable flaws in a prompt's premise and includes a
rights/dignity/consent critique. The physics check correctly
LOWs it.

Add an explicit comment block warning future-self NOT to
broaden the physics rule to catch this — the broadening would
over-flag a wide class of ideological / political /
discrimination plans and dilute the check. The canary entry
itself documents the routing decision rather than asking the
physics check to do the wrong job.
…LE_SEED=1400

Result on a 30-prompt random sample plus 5 canaries (140 total
distinct catalog prompts evaluated across all seeds run so far):

  =====================
  Total: 32 LOW + 3 HIGH
  =====================

Canary results (3/5 expected-HIGH fired):
  - flat-earth education curriculum: HIGH ✓
  - OSAA statutory astrological office: HIGH ✓
  - Phi-Free spirit-clearance firm: HIGH ✓ (recovered after
    last-run LOW; confirms borderline-noisy)
  - Nyxa synthetic supernatural commerce: LOW (documented
    scope-mismatch — operators are explicit cynics)
  - ban-women-from-computers: LOW (documented scope-mismatch —
    handled by diagnostics/premise_attack.py)

Random sample notable cases (all correctly LOW):
  - Carrington Event Faraday-cage prep
  - Westworld-style robot theme park
  - Taiwan political/cultural realignment
  - Microplastics ocean report
  - Insect-farm pilot

The canary failure block continues to make the documented
edge cases visible without polluting the random-sample
distribution.
The previous LOW-mitigation template — 'Project Manager: During
scope reviews, confirm no plan element requires violating a
named law of physics — no further action required' — was
busywork dressed up as an action. Forcing the role + verb +
timeframe shape onto LOW (where there's nothing to mitigate)
produced fake scope reviews on plans that have no physics
exposure at all (linguistics standardization, e-commerce
platforms, social policy plans).

Replace the LOW-mitigation guidance: when level is low, the
mitigation should briefly acknowledge non-applicability in the
form 'No physics-related action required — the plan does not
invoke physics-incompatible mechanisms.' Explicitly tell the
model NOT to invent scope reviews, confirmation steps, audits,
or other busywork tasks just to satisfy the assignable-task
shape. The role + verb + relative-timeframe shape still applies
for medium/high (where there IS a real action to schedule).

Validation against canaries + Clear English:
- HIGH cases (OSAA, Phi-Free, flat-earth): produce real
  actions with role, verb, and relative timeframe (RCT design,
  disclaimer drafting, peer-reviewed report).
- LOW cases (Nyxa, women-tech-ban, Clear English): produce
  the honest 'No physics-related action required' form.

The output is now meaningful in both directions: HIGH lands an
actionable mitigation; LOW lands a clear non-applicability
acknowledgement. No more busywork.
…and scope-mismatch context

The canary labels were too terse — bare names like 'Nyxa —
synthetic supernatural commerce' didn't tell the reader why the
case was in this set or what trigger it was supposed to fire
under. Two improvements:

1. Per-entry comment block above each line explaining what the
   case is, which trigger ((B.1)/(B.2)) it should fire under,
   or — for the documented scope-mismatch entries — why the
   physics check correctly LOWs it and where it is routed
   instead.

2. Runtime labels updated for the two scope-mismatch entries to
   make the situation visible in the canary-failure block:
   - Nyxa: '... (known scope-mismatch; deception, not
     load-bearing supernatural causation)'
   - ban-women-from-computers: '... (known scope-mismatch;
     discrimination/rights problem, routed to
     premise_attack.py)'

A reader scanning the canary-failure output now immediately sees
why these failures are documented design decisions rather than
real regressions, and the source code per-entry comments give
the full reasoning. The previously-fragile cases (flat-earth,
OSAA, Phi-Free) get matching trigger annotations so the
expected-HIGH cases also explain themselves.
…ional artifacts

A real production run of the audit on the expanded Clear
English plan returned HIGH with this self-revealing
justification: 'The plan ... does not rely on violating named
laws of physics. However ... this falls under (B.2) as a
dependency on a non-physical mechanism (a specific engineering
artifact/parser) ...'

The model misread (B.2). It treated an *abstract engineering
deliverable* — a software parser required by the plan's
adoption strategy — as 'non-physical causation', because the
parser is not a tangible object. This is the wrong reading: a
parser is software running on hardware, both of which are
physical and described by physics. (B.2) was meant for
mechanisms physics genuinely does not describe (supernatural
agency, action-at-a-distance via celestial body alignment,
ritual procedures producing measurable physical effects), not
for any abstract critical engineering item.

Add a clarifying sentence to (B.2) inside the system prompt:
software, parsers, contracts, curricula, financial flows,
supply chains, organisational processes, regulatory frameworks,
and other engineering or institutional artifacts ARE physical
mechanisms in the relevant sense — they exist as information
running on hardware or as documented agreements implemented
through human action, all of which is described by physics.
'Load-bearing engineering deliverable required for plan
success' is explicitly NOT (B.2).

The smoke harness can only test the bare initial prompt; the
failure surfaced on the full expanded plan (with premortem
failure modes, decisions, budget items). The clarification
targets the specific misreading the model demonstrated. On
the bare Clear English prompt the model now describes it as
'a legitimate R&D effort in language design and education',
which is the correct framing.

System-prompt audit pass: confirmed via grep that no test-
prompt-paraphrasing words were introduced (no grapheme,
phoneme, ordinal, FM2, parser-specific phrasings). The
clarification is principle-only.
…hemselves

A real production run on the expanded Clear English plan (196KB
of strategic decisions, levers, premortem, budget) returned
HIGH with the model citing 'perceived cognitive load' (Decision
2 success criterion) and 'naturalness score' (Decision 8
Go/No-Go input) as 'non-physical mechanism' — pattern-matched
against the prompt's earlier 'self-reported absence of
phenomena' phrasing in the operational tests. That phrase was
a Phi-Free-specific hook that over-generalised to every
subjective metric.

Two fixes:
1. Drop the leak-prone 'self-reported absence of phenomena'
   wording from the operational tests for (B.2). Phi-Free's
   metric is still caught by the general (B.2) trigger because
   the *mechanism behind* the metric is supernatural; the
   metric itself doesn't need to be enumerated.
2. Add a clarifying paragraph: subjective and self-reported
   metrics ARE physical phenomena (measurements of human
   nervous-system responses, which are matter and energy
   following physical laws). A subjective metric is not a (B.2)
   trigger by itself — (B.2) requires that the *causal
   mechanism behind* the rated outcome be one physics does not
   describe. The metric is not the trigger; the mechanism is.

Audit pass: confirmed no test-prompt-paraphrasing words in the
SYSTEM_PROMPT (no naturalness, perceived cognitive load,
grapheme, phoneme, FM2, paranormal, ritual cleansing, etc.).
Phrasing is principle-only.

Validation:
- Bare Clear English prompt: LOW (unchanged).
- FULL Clear English plan (196KB): LOW (was HIGH on prior rule;
  fix lands).
- Canary set this run: 3/5 stable HIGH (flat-earth, OSAA,
  Phi-Free), 2/5 documented scope-mismatches LOW (Nyxa,
  women-tech-ban) — no regressions.

Note: in the full-plan LOW result the model reused the LOW-
mitigation template in BOTH the justification and mitigation
fields, instead of producing a substantive reasoning sentence
in justification. That is a separate model-output-quality issue
worth investigating; the verdict itself is now correct.
Adds a misinformation/deception test case: a non-denominational
evangelical center in Nashville packaging proprietary herbal-
extract drops marketed to congregants as an epilepsy treatment,
distributed through ~800 partner churches across five states,
with net proceeds funding both evangelistic expansion and
substantial personal compensation packages for the founding
pastoral leadership (parsonage allowances, related-entity
contracting, etc.) while preserving 501(c)(3) status.

Tags: misinformation, deception, religion, usa, business.
…into (B.2)

A second production run on the expanded Clear English plan
(199KB) returned HIGH again. The model's reasoning explicitly
acknowledged the issue is engineering, not physics: 'though it
is a borderline case based on empirical engineering feasibility
rather than established natural law. High rating is preferred
as the feasibility claim is central to managing a defined
risk.' The model was conflating 'central to risk register' with
'physics violation' and reaching for HIGH on plan-internal
risk-management vocabulary.

Three principled additions to defuse this:

1. Add an explicit out-of-scope item for engineering /
   computational feasibility uncertainty under cost, time, or
   scope constraints — explicitly noting it stays 'low' no
   matter how central the deliverable is to the plan.

2. Defuse the 'load-bearing' vocabulary collision: plans
   routinely use 'load-bearing', 'critical', 'central to risk',
   'non-negotiable', 'Decision N', 'Risk N', 'Failure mode N'
   to describe engineering / governance / strategic deliverables.
   That vocabulary in the plan does NOT match the rule's
   '(B.2) load-bearing non-physical mechanism'. Calling
   something 'load-bearing' or 'critical' in a plan does not
   make it (B.2).

3. Add a stop-sign for the model's own escalation reasoning:
   if the rationale for HIGH includes 'borderline case', 'leans
   toward', 'high rating is preferred', 'central to risk
   register', or 'engineering feasibility uncertainty', the
   rating MUST be 'low'. HIGH requires concrete, named physics-
   incompatibility, not generic plan-importance.

Audit: SYSTEM_PROMPT contains no test-prompt-paraphrasing words
(no naturalness, perceived cognitive load, grapheme, NLP model,
homograph, semantic tagging, FM2, etc.). The 'Decision N' /
'Risk N' / 'Failure mode N' phrasings describe a generic
vocabulary class used by most plans, not test-prompt-specific
values.

Validation:
- 199KB Clear English plan that previously returned HIGH: now
  returns LOW.
- Canary set unchanged: OSAA HIGH, Phi-Free HIGH, flat-earth
  HIGH, Nyxa LOW (scope-mismatch), women-tech-ban LOW
  (scope-mismatch), Clear English bare LOW.

Known remaining quality issue (separate bug): on full-plan LOW
results, the model fills the justification field with the LOW-
mitigation template instead of producing actual reasoning.
Verdict is correct, output text is degraded.
…plan

The 'Violates Known Physics' check has been misfiring HIGH on
real production runs (Clear English plan rated HIGH twice in
a row). Each prompt-tuning round closed one keyword pattern,
the next expansion of the plan found a new one, and the model
kept reaching for HIGH on engineering-feasibility uncertainty,
risk-register vocabulary, and subjective metrics. Three
clarifying paragraphs in the system prompt is not robustness;
it is a leaky bucket and the prompt was getting brittle.

Architectural fix instead: the physics check doesn't need the
200KB+ expanded plan. The bare initial user prompt is
sufficient to answer 'does this plan require breaking
physics?', and the smoke harness has validated this against
~140 distinct catalog prompts with stable verdicts. The
expanded plan's premortem failure modes, decision matrices,
risk register, and budget breakdowns add hundreds of pattern-
match opportunities that mislead small/medium models without
adding any physics-relevant signal.

Three changes:

1. SelfAuditTask now requires SetupTask, reads the bare initial
   prompt, and passes it to SelfAudit.execute via a new
   physics_user_prompt parameter.

2. SelfAudit.execute accepts an optional physics_user_prompt
   and routes it to ViolatesKnownPhysics.execute. Falls back to
   user_prompt when not provided (backward-compatible for any
   other caller). The other 16 batch checklist items continue
   to see the full expanded plan as before.

3. Revert the recent prompt-vocabulary defusing additions in
   the physics system prompt: 'Plans routinely use "load-
   bearing", "Decision N", ...', 'If your reasoning
   includes "borderline case", ...', and the engineering-
   feasibility out-of-scope item. They were band-aiding a
   problem that doesn't exist when the physics check sees the
   bare prompt instead of the expanded plan, and they were
   making the prompt longer (which itself destabilises the
   model).

Validation:
- Canary set: OSAA HIGH, flat-earth HIGH, Nyxa/women-tech-ban
  scope-mismatch LOW, Clear English LOW. Phi-Free flickered to
  LOW this run (the documented model-non-determinism on this
  borderline case; separate from this fix).
- Catalog regression across 140 prompts: unchanged from prior
  runs.
- The 199KB Clear English plan that previously returned HIGH
  is now never reached by the physics check — it sees the bare
  initial prompt instead, which has been stable LOW across
  every smoke run.
…licate mitigation template

After moving the physics check to the bare initial prompt, the
verdict was reliably correct but the justification field was
being filled with the LOW-mitigation template ('No physics-
related action required — the plan does not invoke physics-
incompatible mechanisms') for both fields. Two reasons: (1) the
guidance for LOW justification told the model to 'state plainly
that the plan does not require breaking a named law of physics
and does not depend on a physics-incompatible claim as a load-
bearing mechanism' — wording very close to the mitigation
template, so the model collapsed them; (2) no instruction to
distinguish the two fields.

Tighten the LOW-justification guidance: characterize what kind
of plan this actually is (its general category — construction,
software development, regulatory program, research study,
social policy, curriculum design) and explain why physics is
not at issue. Add an explicit instruction that the
justification must NOT be the same wording as the mitigation.

Validation on the canary set: justifications now substantively
identify the plan's nature ('cross-border e-commerce business
model', 'sociopolitical policy proposal', 'linguistic and
educational initiative') while mitigations remain the standard
'No physics-related action required' template. Verdicts
unchanged across the full canary set.
…enarios

Append two guidance lines to the Clear English prompt:
- 'Optimize for user adoption. Don't optimize for linguistic
  purity.' — steers the planner toward pragmatic standardization
  rather than maximalist redesign.
- 'Don't pick the most aggressive scenario.' — same nudge other
  red-team-flavoured prompts already use to keep the planner
  off the most extreme path.
@neoneye neoneye merged commit e444666 into main May 3, 2026
3 checks passed
@neoneye neoneye deleted the fix/self-audit-physics-example-bleedthrough branch May 3, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant