Skip to content

Fix U2 background subtype-clobber (silent species-correction disable)#18

Open
glarue wants to merge 1 commit into
mainfrom
fix/u2-background-clobber
Open

Fix U2 background subtype-clobber (silent species-correction disable)#18
glarue wants to merge 1 commit into
mainfrom
fix/u2-background-clobber

Conversation

@glarue

@glarue glarue commented Jun 1, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes a silent bug in the species U2 background correction that disables the
correction on annotations carrying non-canonical splice dinucleotides, inflating
scores on compositionally-divergent species and producing high-confidence U12
false positives.

Root cause

SpeciesBackground._build_final_pwm_sets maps all non-canonical splice dnts to
'gtag' (via the FIVE/THREE_DNT_TO_SUBTYPE defaults) and assigned the corrected
matrix into matrices[('u2', pwm_subtype)] in sorted-dnt order with
last-writer-wins
(added in c406498 for streaming/in-memory determinism). Because
T-starting 5′ dnts and most non-AG 3′ dnts sort after the canonical GT/AG, a
handful of spurious non-canonical introns (e.g. 3× TT) win the gtag slot. Their
blend weight w = n/(n+n0) is ~0 at low n, so the corrected background collapses to
the human U2 prior — silently reverting the species correction.

Impact

Conditional: harmless on near-human-composition species (revert-to-human is a no-op
— which is why gold validation passed and this went unnoticed), but it inflates raw
scores on compositionally-divergent species (AT-rich protists, etc.). Observed: a
confirmed U12-loss ciliate went 1 → 95 HC U12 on an annotation that happened to
carry a few non-canonical introns.

Fix

Iterate subtypes by (-n, dnt) and first-writer-wins, so the highest-n (canonical)
dnt defines each pwm_subtype
and rare low-n non-canonical dnts cannot clobber it.
The sort is order-independent, preserving the streaming/in-memory determinism
c406498 sought. Applied to the 5′/3′ and BPS loops.

Verification

  • New regression test (test_low_n_noncanonical_dnt_does_not_clobber): fails
    pre-fix
    (background reverts to the human prior), passes post-fix.
  • Full unit suite green (657 passed).
  • End-to-end: the loss ciliate goes 95 → 3 HC; a clean-annotation run of the same
    species is unchanged (1 → 1); the dumped gtag background goes
    0.0006 → 0.136 (≈ the correctly-corrected value).

Validation

  • Regression test (test_low_n_noncanonical_dnt_does_not_clobber): fails pre-fix (background reverts to the human prior), passes post-fix.
  • Unit suite: green (657 passed on main-line; 12 on v2.4.x).
  • Gold recall preserved (buggy→fixed HC): A. thaliana 291→288, D. melanogaster 20→19, Chlamydomonas (loss) 3→3 — negligible; the fix does not disturb validated species.
  • Divergent bearers preserved: gut fungi (Neocallimastix/Piromyces/Anaeromyces) and Basidiobolus HC change ≤10 — real U12 calls retained; core_fraction stays high (e.g. 0.97→0.77) so they remain trusted.
  • ⚠️ One integration test (test_modesep_pipeline.py::test_continuous_discount_preserves_strong_TPs) fails — but it is pre-existing on main (fails on unfixed main too) and is unrelated to this change. (Not present on the v2.4.x line.)

…type

The species U2 background build mapped all non-canonical splice dinucleotides
to 'gtag' (via the FIVE/THREE_DNT_TO_SUBTYPE defaults) and assigned the corrected
matrix into matrices[('u2', pwm_subtype)] in sorted-dnt order with last-writer-wins
(added in c406498 for streaming/in-memory determinism). Because T-starting 5' dnts
and most non-AG 3' dnts sort AFTER the canonical GT/AG, a handful of spurious
non-canonical introns (e.g. 3x 'TT') would win the slot. Their blend weight
w = n/(n+n0) is ~0 for low n, so the corrected background collapsed to the human
U2 prior — silently disabling the species correction on any annotation carrying
non-canonical dnts.

Impact is conditional: harmless on near-human-composition species (reverting to
human is a no-op; gold validation passed, which is why it went unnoticed), but it
inflates raw scores on compositionally-divergent species (AT-rich protists, etc.),
producing high-confidence false positives (observed: a confirmed U12-loss ciliate
went 1 -> 95 HC U12 on an annotation that happened to carry a few non-canonical
introns).

Fix: iterate subtypes by (-n, dnt) and first-writer-wins, so the highest-n
(canonical) dnt defines each pwm_subtype background and rare low-n non-canonical
dnts cannot clobber it. The sort is order-independent, preserving the
streaming/in-memory determinism c406498 sought. Applied to both the 5'/3' and BPS
loops. Adds a regression test (fails pre-fix: background reverts to the human
prior; passes post-fix).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant