-
Notifications
You must be signed in to change notification settings - Fork 1
Example usage
This page provides practical examples for common intronIC use cases. For full argument documentation, see the Usage info page.
The easiest way to verify your installation:
# Run bundled test (Human Chr19, ~1 min with -p 4)
intronIC test -p 4
# Show test data location on your system
intronIC test --show-onlyIf you prefer to run classification manually with test data:
- Test data is bundled with the package—use
intronIC test --show-onlyto find its location - Alternatively, download the chromosome 19 test files:
The default pretrained model is loaded automatically:
intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz \
-a Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
-n homo_sapiensThis works for virtually all species. You can optionally specify a custom model:
intronIC -g genome.fa -a annotation.gff -n species --model custom.model.pklTo train a model on reference sequences:
intronIC train -n homo_sapiensThis creates a .model.pkl file that can be used for classification; model training (depending on selected options) can take many hours. The default model should serve most users well in most cases.
To extract introns without classification:
intronIC extract -g genome.fa -a annotation.gff -n speciesInformation about the run will be printed to the screen; this same information (plus some additional details) can be found in the log.iic file. The excerpt below is illustrative — exact line counts, status lines, and model-loading messages may differ across versions; v2.7 adds mode-separation and continuous-discount log lines not shown here.
================================================================================
intronIC v2.7.0
Started: 2025-12-08 12:44:39
================================================================================
Command and Configuration:
Command: /home/glarue/code/intronIC/.pixi/envs/default/bin/intronIC -g GCF_000001405.40_GRCh38.p14_genomic.fna.gz -a
GCF_000001405.40_GRCh38.p14_genomic.gff.gz -n homo_sapiens.cds -p 8 -f cds
Working directory: /home/glarue/code/intronIC/run_tests/hsapiens
Run name: homo_sapiens.cds
Input mode: annotation
Classification threshold: 90.0%
Output directory: /home/glarue/code/intronIC/run_tests/hsapiens
Genome: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.fna.gz
Annotation: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
Model: /home/glarue/code/intronIC/src/intronIC/data/default_pretrained.model.pkl
ℹ Streaming mode: processing per-contig
ℹ Loading pretrained model from /home/glarue/code/intronIC/src/intronIC/data/default_pretrained.model.pkl
Loaded two-pass mode-separation bundle (first-pass v4_aug_cluster_aware + second-pass v5_modesep_aug; 126 models per ensemble, 3 seeds × 42 sub-models)
Adaptive normalizer fit: scoring introns through PWMs to fit RobustScaler
ℹ Loading PWM matrices
ℹ Indexing annotation: GCF_000001405.40_GRCh38.p14_genomic.gff.gz
Indexed 4,932,571 annotations across 705 contigs
ℹ Using indexed genome access: GCF_000001405.40_GRCh38.p14_genomic.fna.gz
ℹ Processing 705 contigs in parallel (8 processes)
Merging output: 202,594 (11.89%) scored + 45,650 (2.68%) omitted = 248,244 (14.56%) total introns for output files
ℹ Streaming classification complete: 202,594 introns classified
Total genes: 55,619, introns generated: 1,704,427
Intron Filtering Summary:
┌────────────────────────────┬────────────┬────────────┐
│ Category │ Included │ Excluded │
├────────────────────────────┼────────────┼────────────┤
│ Duplicates │ 0 │ 1,457,363 │
│ Too short │ 0 │ 240 │
│ Ambiguous bases │ 0 │ 4 │
│ Non-canonical │ 525 │ 0 │
│ Overlapping │ 0 │ 0 │
│ Alternative isoform │ 0 │ 45,211 │
├────────────────────────────┼────────────┼────────────┤
│ Total excluded │ │ 1,502,818 │
│ Retained for scoring │ │ 201,414 │
└────────────────────────────┴────────────┴────────────┘
Classification Results (threshold: 90.0%):
┌──────────────────────┬───────────┬────────────┐
│ Type │ Count │ Percentage │
├──────────────────────┼───────────┼────────────┤
│ U12-type (total) │ 702 │ 0.35% │
│ U12-type (AT-AC) │ 185 │ 0.09% │
│ U2-type │ 201,892 │ 99.65% │
├──────────────────────┼───────────┼────────────┤
│ Total │ 202,594 │ 100.00% │
└──────────────────────┴───────────┴────────────┘
If only the intron sequences are desired, use the extract subcommand which skips classification and produces only a subset of the output files:
intronIC extract -g genome.fa -a annotation.gff -n speciesFor complex or reproducible runs, use a YAML configuration file:
# Generate a template configuration file
intronIC --generate-config > my_config.yaml
# Edit my_config.yaml, then run:
intronIC --config my_config.yaml -g genome.fa -a annotation.gff -n speciesExample configuration:
scoring:
threshold: 90.0
exclude_noncanonical: false
score_adjustment:
enabled: true
extraction:
flank_length: 100
feature_type: both
training:
ensemble:
n_models: 42
eval_mode: nested_cv
performance:
processes: 8For most species, the default settings work well. The v3 model bundle ships with a multispecies fallback scaler that is used automatically for very small inputs (fewer than 200 scoreable introns), so single-intron / tiny-annotation runs work out of the box.
Two cases where you might want to override the default:
# First run: fit and save adaptive normalizer on full genome
intronIC -g genome.fa -a annotation.gff -n species \
--normalizer-mode adaptive --save-normalizer
# Subsequent runs: reuse the normalizer
intronIC -g subset.fa -a subset.gff -n species \
--load-normalizer species.normalizer.pklIf you want to suppress per-species z-score shifts entirely (e.g., for U12-absent / outlier genomes where adaptive can compress the U2 distribution):
intronIC -g genome.fa -a annotation.gff -n species \
--normalizer-mode humanIn a v3 bundle, --normalizer-mode human resolves to the bundled multispecies fallback scaler.
Note: Both are advanced features. For standard analyses on normal-sized genomes, default settings are correct.
Speed up analysis with parallel processes (streaming mode is default and scales efficiently):
intronIC -g genome.fa -a annotation.gff -n species -p 8The -p flag parallelizes the entire extraction and scoring pipeline. With streaming mode (default), using -p 5-10 typically provides 2-3× speedup with moderate memory usage.
--streaming and --in-memory produce bit-identical classifications since v2.4 (covered by tests/integration/test_streaming_equivalence.py); the choice is purely a runtime/memory tradeoff. Reference run on full human GRCh38.p13 + NCBI RefSeq GFF, 257k scored introns, -p 5, default v2.7 bundle:
| Mode | Wall time | Peak RSS |
|---|---|---|
--streaming (default) |
~40 min | ~5.3 GB |
--in-memory |
(not re-measured for v2.7; expected similar wall time, roughly 2× peak memory) |
# Streaming mode is automatic — no flag needed
intronIC -g GRCh38.fa.gz -a gencode.gff3.gz -n homo_sapiens -p 8Streaming writes intron sequences to a temporary on-disk SQLite database during extraction, keeps only scoring motifs in memory, and parallelizes each phase (extraction, BG correction, adaptive-normalizer fit, first-pass classification, mode-separation second pass) per-contig.
intronIC -g genome.fa -a annotation.gff -n species --in-memoryIn-memory loads all intron sequences into memory at extraction time. It is also the path used internally by --sequences and --bed input modes (those bypass the per-contig streaming pipeline). On small single-contig inputs the per-contig overhead of streaming mode means in-memory is somewhat faster; on multi-contig genomes the two are roughly tied at typical parallelism levels.
Many additional options exist for a variety of use cases. Run intronIC --help for additional details and/or see the Full usage info page.