Fix U2 background subtype-clobber (silent species-correction disable)#18
Open
glarue wants to merge 1 commit into
Open
Fix U2 background subtype-clobber (silent species-correction disable)#18glarue wants to merge 1 commit into
glarue wants to merge 1 commit into
Conversation
…type
The species U2 background build mapped all non-canonical splice dinucleotides
to 'gtag' (via the FIVE/THREE_DNT_TO_SUBTYPE defaults) and assigned the corrected
matrix into matrices[('u2', pwm_subtype)] in sorted-dnt order with last-writer-wins
(added in c406498 for streaming/in-memory determinism). Because T-starting 5' dnts
and most non-AG 3' dnts sort AFTER the canonical GT/AG, a handful of spurious
non-canonical introns (e.g. 3x 'TT') would win the slot. Their blend weight
w = n/(n+n0) is ~0 for low n, so the corrected background collapsed to the human
U2 prior — silently disabling the species correction on any annotation carrying
non-canonical dnts.
Impact is conditional: harmless on near-human-composition species (reverting to
human is a no-op; gold validation passed, which is why it went unnoticed), but it
inflates raw scores on compositionally-divergent species (AT-rich protists, etc.),
producing high-confidence false positives (observed: a confirmed U12-loss ciliate
went 1 -> 95 HC U12 on an annotation that happened to carry a few non-canonical
introns).
Fix: iterate subtypes by (-n, dnt) and first-writer-wins, so the highest-n
(canonical) dnt defines each pwm_subtype background and rare low-n non-canonical
dnts cannot clobber it. The sort is order-independent, preserving the
streaming/in-memory determinism c406498 sought. Applied to both the 5'/3' and BPS
loops. Adds a regression test (fails pre-fix: background reverts to the human
prior; passes post-fix).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a silent bug in the species U2 background correction that disables the
correction on annotations carrying non-canonical splice dinucleotides, inflating
scores on compositionally-divergent species and producing high-confidence U12
false positives.
Root cause
SpeciesBackground._build_final_pwm_setsmaps all non-canonical splice dnts to'gtag'(via theFIVE/THREE_DNT_TO_SUBTYPEdefaults) and assigned the correctedmatrix into
matrices[('u2', pwm_subtype)]in sorted-dnt order withlast-writer-wins (added in c406498 for streaming/in-memory determinism). Because
T-starting 5′ dnts and most non-AG 3′ dnts sort after the canonical
GT/AG, ahandful of spurious non-canonical introns (e.g. 3×
TT) win thegtagslot. Theirblend weight
w = n/(n+n0)is ~0 at low n, so the corrected background collapses tothe human U2 prior — silently reverting the species correction.
Impact
Conditional: harmless on near-human-composition species (revert-to-human is a no-op
— which is why gold validation passed and this went unnoticed), but it inflates raw
scores on compositionally-divergent species (AT-rich protists, etc.). Observed: a
confirmed U12-loss ciliate went 1 → 95 HC U12 on an annotation that happened to
carry a few non-canonical introns.
Fix
Iterate subtypes by
(-n, dnt)and first-writer-wins, so the highest-n (canonical)dnt defines each
pwm_subtypeand rare low-n non-canonical dnts cannot clobber it.The sort is order-independent, preserving the streaming/in-memory determinism
c406498 sought. Applied to the 5′/3′ and BPS loops.
Verification
test_low_n_noncanonical_dnt_does_not_clobber): failspre-fix (background reverts to the human prior), passes post-fix.
species is unchanged (1 → 1); the dumped
gtagbackground goes0.0006 → 0.136 (≈ the correctly-corrected value).
Validation
test_low_n_noncanonical_dnt_does_not_clobber): fails pre-fix (background reverts to the human prior), passes post-fix.test_modesep_pipeline.py::test_continuous_discount_preserves_strong_TPs) fails — but it is pre-existing onmain(fails on unfixedmaintoo) and is unrelated to this change. (Not present on the v2.4.x line.)