Skip to content

History / Technical algorithm

Revisions

  • Document .modesep.json sidecar + quality tiers in Technical-algorithm Cover the per-species mode-separation diagnostic output and the A/B/C/F quality-tier rubric in the wiki, so the wiki is a complete reference for the two-pass classifier (previously only fully documented in the repo's docs/mode_separation.md, which is gitignored). Also describe the boundary_mass diagnostic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 20, 2026
  • Update wiki for v2.7 (two-pass mode-separation + continuous discount) Reframe the documentation around the v2.6+ two-pass architecture and the v2.7 continuous per-intron discount: - Technical-algorithm.md: rewrite Pipeline Overview from 5 stages to 7 (adaptive normalizer fit, first-pass classification, mode estimation + gate, mode-separation second pass, continuous discount). Add new "Why two passes?" rationale, first-pass / mode estimation / gate / second-pass subsections, and a v2.7 continuous-discount section under Score Adjustment. Reframe the legacy Bayesian valley-depth adjustment as the gate-fail-only path. Update training-the-default-model section for the v4_aug + v5_modesep_aug bundle (including HP optimality verification). - Overview.md, Home.md, About.md: drop "five-stage" / single-ensemble framing; describe two-pass + continuous discount. - Output-files.md: reframe adjusted_score as the v2.7 calling column; add v2.6 (first_pass_svm, modesep_route) and v2.7 (raw_sum, svm_vs_naive, voting_frac) columns. - Quick-start.md, Example-usage.md: refresh benchmark to v2.7 (HomSap ~40 min / ~5.3 GB at -p 5, RefSeq GFF, 257k scored introns; DroMel ~8 min / ~0.8 GB). In-memory not re-measured for v2.7. - Usage-info.md: threshold default 95 -> 90 (correcting stale line). - Training-data-and-PWMs.md: update preamble for the v2.6+ two-pass bundle (v4_aug + v5_modesep_aug; 502K-row second-pass corpus). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 20, 2026
  • Update normalizer docs for v2.4.2 Three pages mentioned scaler / --load-normalizer / --normalizer-mode behavior that drifted between v2.4.0 and v2.4.2: - Technical-algorithm: rewrote Normalization Modes to describe the bundled multispecies fallback scaler (new in v2.4.2), the auto/ adaptive small-input fall-through at MIN_ADAPTIVE_INTRONS=200, and --load-normalizer's role across streaming + in-memory paths. Added a forward reference from the v3 default-model section. - Example-usage: split "Custom normalization" into two cases — reproducible saved-scaler workflow (unchanged) and forcing the bundled multispecies scaler via --normalizer-mode human (new guidance for U12-absent / outlier genomes). - Usage-info: refreshed the --load-normalizer help block to match the actual contract (works in both modes, overrides bundle scaler).

    @glarue glarue committed May 10, 2026
  • Document Fisher's discriminant valley projection (3D) in Technical-algorithm Updates the Stage 5 / valley-depth section to reflect the change from naive 2D centroid direction to Fisher's discriminant in 3D (5'z, BPz, 3'z). Notes why adding 3'z is a win under Fisher's reweighting (it isn't under the naive direction). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 10, 2026
  • Exclude duplicates from "Why Normalize?" empirical stats Re-runs the human raw-score range table on 257,123 deduplicated introns (excluded ~4,299 [d]-tagged duplicates that pre-v2.4 score_info contained; rest of the omitted introns have NA raws and were already excluded). Numbers move by at most 0.1 — duplicates are ~1.7% of the dataset — but the deduplicated count is the methodologically correct one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 10, 2026
  • Ground "Why Normalize?" example ranges in real human data Replaces the ballpark per-region score ranges with empirical numbers from scoring all 261,422 human (GRCh38 + Ensembl 104) introns post background correction. The 5'SS in particular was understated in the old text — it spans far more than "-50 to +10" and is heavily negative-biased (median ~-41) because most introns are U2-type. Clarifies the rationale for normalization (5'SS would dominate the kernel; all regions land on comparable scales after RobustScaler). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 10, 2026
  • Align v2.4 / v2.4.1 wiki claims with the actual training corpus - Correct "97 species" → 90 training species + 7 evaluation-only holdouts (5 recall + 2 protist) across About, Technical-algorithm, and Training-data-and-PWMs - Update the AT-AA paragraph to reflect the two-stage screen result (moderate 5'SS / 3'SS discrimination, BPS at noise floor; recall impact is academic so PWM addition is deferred until ≥1 other nc subtype passes the same screen) - Note the v2.4.1 bundling of the multispecies training set in Technical-algorithm - Reflect 126-model default in algorithm overview Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 10, 2026
  • Wiki: align training-data, classifier, and resource-usage pages with v2.4 Across pages: replace v2.3-era "human-only training" framing with the v3 multispecies default; replace stale "~85% memory savings" / "~2 GB vs ~12 GB" memory claims with the measured v2.4 figures (~5.4 GB streaming vs ~10.1 GB in-memory on full human at -p 6); call out bit-identical streaming/in-memory equivalence. Per page: - Technical-algorithm: reframe Species-Specific Background Correction for v2.4 — its primary role shifts from "fix human-only model bias" to "inference-time robustness layer for out-of-distribution species," with a note that the v3 corpus was scored with BG on so disabling at inference creates a train/inference distribution mismatch. Update Normalization Modes table to reflect that the v3 default has no saved scaler and "auto" therefore falls through to "adaptive." - Training-data-and-PWMs: clarify that the v3 multispecies training corpus is not bundled (only the trained model is), and that intronIC train still loads the v2.3 reference sets by default. - About: drop the inaccurate "linear SVM" / "Platt scaling" wording — v2.3+ uses an RBF SVM with isotonic calibration as the default (Platt as cross-validated fallback). - Home, Overview, Example-usage, Usage-info: update memory and runtime numbers to match v2.4 reference benchmarks.

    @glarue glarue committed May 10, 2026
  • Wiki: align resource usage with v2.4 streaming/in-memory equivalence - Replace stale Streaming/Standard-mode memory and runtime estimates (the old "~2-3 GB peak / ~6-10 min" line predates the v2.4 multispecies default model and the per-contig parallel pipeline). - Add a reference benchmark table on full human GRCh38 + Ensembl 104 (~227k introns, -p 6): streaming ~16 min / 5.4 GB peak, in-memory ~15 min / 10.1 GB peak. - Call out bit-identical equivalence between --streaming (default) and --in-memory; mode choice is now purely a runtime/memory tradeoff. - Note that --sequences and --bed input modes feed the in-memory path.

    @glarue glarue committed May 10, 2026
  • Wiki: align with v2.4 (multispecies default, threshold 90, streaming + v3) - Home.md / Overview.md: update version to 2.4, model from 42- to 126-model multispecies ensemble, threshold from 95 to 90. - Quick-start.md / Example-usage.md / Output-files.md / Training-data-and-PWMs.md: same numerical updates. - Technical-algorithm.md: rewrite "Training the Default Model" to describe the v3 multispecies corpus (41,333 introns, 97 species, 14 clades, F1 = 1.000 vs v2.3 0.9975, ~330k scored introns FPR comparison). Document that streaming-classify supports both v2.3 and v3 bundles via the new per-contig adaptive-fit pre-pass. Fix the valley-depth math formula KaTeX rendering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 9, 2026
  • Standardize U12/U2 terminology to U12-type/U2-type throughout All public-facing references to intron types, PWMs, reference sets, and scoring concepts now use the formal "U12-type"/"U2-type" suffix. Internal format strings (motif columns, LaTeX formulas) and CLI help text unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 24, 2026
  • Update wiki for v2.3.0: 6D features, 42-model ensemble, score adjustment - Home: version 2.3, updated feature list - Overview: 6D/42-model pipeline, 95% threshold, score adjustment stage - Technical-algorithm: 6D feature space, BG correction section, score adjustment section with formula/config, updated hyperparams and ensemble - Output-files: 32 columns (added adjusted_score, ensemble_sigma), adjusted score subsection, rel_score = adjusted_score - threshold - Training-data: version labels to v2.3.0 - Usage-info: threshold default 95 - Example-usage: config/log examples updated - Quick-start: threshold 95%, adjusted probability language Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 24, 2026
  • Technical details: add confident U12-type counts to valley depth examples Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
  • Technical details: document species-level cluster validation (valley depth) Describe the multi-bandwidth density valley detection algorithm, the valley depth metric, interpretation guidelines with example species values, and the warning message for no-valley cases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
  • Technical details: add reference column to scoring regions table Clarify that 5'SS coordinates are relative to the intron 5' end while BPS and 3'SS coordinates are relative to the intron 3' end. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
  • Technical details: note default settings work well for U12-absent species The v2.2 model produces zero confident FPs in C. elegans with default settings, so the prior adjustment is unlikely to be needed by most users. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
  • Technical details: improve parallelization wording Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
  • Fix WtMTA citation: add Larue & Roy 2023, keep Moyer et al. 2020 The WtMTA database paper is Larue & Roy 2023 (NAR 51:10884-10908), separate from the original intronIC paper (Moyer et al. 2020). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
  • Italicize species names, add Burke 2018 and WtMTA citations - Italicize C. elegans and Ascaris suum in Technical-algorithm.md - Add Burke et al. 2018 (spliceosome profiling) to branch point references - Add Moyer et al. 2020 (WtMTA/intronIC) to U12-type intron databases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
  • Technical details: remove specific bp_scan_confidence numbers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Technical details: clarify bp_scan_confidence values are from training data Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Technical details: remove unsourced specific values for bp_scan_confidence Replace training-data-specific numbers with qualitative description. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Technical details: cite Pineda & Bradley 2018 for BP position distributions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Use consistent U12-type/U2-type nomenclature throughout Replace bare "U12 introns", "U2 introns", "U12 motifs" with "U12-type introns", "U2-type introns", "U12-type motifs" in user-facing prose. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Fix CoLa-seq citation: Luo et al. 2023 → Zeng et al. 2022 The CoLa-seq branch point data is from Zeng et al. (2022) Mol Cell 82:4681-4699, not "Luo et al. 2023". Add full citation to References. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Fix technical discrepancies found during wiki-vs-code audit - Overview: fix "Linear SVM" → RBF SVM, update BP distance range to 10-15 nt - Output-files: fix log base (log_10 → log_2), fix awk filter ("." → "NA"), fix frac_pos column index ($11 → $12), update attributes format to verbose strings, remove outdated score_info example line, remove -s flag reference - Technical-algorithm: add StandardScaler note, harmonize memory estimates, fix BP search region length description - Usage-info: remove defunct -s flag, note CLI vs config.yaml default differences for scoring coords - Quick-start: harmonize memory estimate with Technical-algorithm page Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Technical details: fix PWM fallback description to cover both U12 and U2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Technical details: clarify PWM selection and scoring algorithm Document per-intron dinucleotide-based PWM selection, U2 fallback masking, and the two-step BPS scoring process (position selection with U12 PWM, then log-ratio at the same position with both PWMs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Technical details: add U2 AT-AC PWMs to matrix listing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
  • Update wiki for v2.2.0: 8D RBF SVM default model - Home: update version reference - Technical details: document 8D feature set with linear coefficient reference, RBF kernel, isotonic calibration, non-overlapping scoring regions, BPS scan confidence metric, updated training data and evaluation results - Training data and PWMs: document expanded reference sets (472 U12 + 30,155 U2), CoLa-seq BPS PWMs with reference_offset, U2 AT-AC PWMs - Output files: update score_info.iic column listing (30 columns), add bp_offset and attributes columns to meta.iic - Overview: update classification mode description Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026