Example usage

This page provides practical examples for common intronIC use cases. For full argument documentation, see the Usage info page.

Quick test

The easiest way to verify your installation:

# Run bundled test (Human Chr19, ~1 min with -p 4)
intronIC test -p 4

# Show test data location on your system
intronIC test --show-only

Test data for manual runs

If you prefer to run classification manually with test data:

Test data is bundled with the package—use intronIC test --show-only to find its location
Alternatively, download the chromosome 19 test files:
- FASTA
- GFF3

Basic usage

Classification (recommended for most users)

The default pretrained model is loaded automatically:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz \
         -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
         -n homo_sapiens

This works for virtually all species. You can optionally specify a custom model:

intronIC -g genome.fa -a annotation.gff -n species --model custom.model.pkl

Training a new model

To train a model on reference sequences:

intronIC train -n homo_sapiens

This creates a .model.pkl file that can be used for classification; model training (depending on selected options) can take many hours. The default model should serve most users well in most cases.

Extracting intron sequences only

To extract introns without classification:

intronIC extract -g genome.fa -a annotation.gff -n species

Information about the run will be printed to the screen; this same information (plus some additional details) can be found in the log.iic file. The excerpt below is illustrative — exact line counts, status lines, and model-loading messages may differ across versions; v2.7 adds mode-separation and continuous-discount log lines not shown here.

================================================================================
intronIC v2.7.0
Started: 2025-12-08 12:44:39
================================================================================

Command and Configuration:
  Command: /home/glarue/code/intronIC/.pixi/envs/default/bin/intronIC -g GCF_000001405.40_GRCh38.p14_genomic.fna.gz -a
GCF_000001405.40_GRCh38.p14_genomic.gff.gz -n homo_sapiens.cds -p 8 -f cds
  Working directory: /home/glarue/code/intronIC/run_tests/hsapiens
  Run name: homo_sapiens.cds
  Input mode: annotation
  Classification threshold: 90.0%
  Output directory: /home/glarue/code/intronIC/run_tests/hsapiens
  Genome: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.fna.gz
  Annotation: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
  Model: /home/glarue/code/intronIC/src/intronIC/data/default_pretrained.model.pkl

ℹ Streaming mode: processing per-contig
ℹ Loading pretrained model from /home/glarue/code/intronIC/src/intronIC/data/default_pretrained.model.pkl
Loaded two-pass mode-separation bundle (first-pass v4_aug_cluster_aware + second-pass v5_modesep_aug; 126 models per ensemble, 3 seeds × 42 sub-models)
Adaptive normalizer fit: scoring introns through PWMs to fit RobustScaler
ℹ Loading PWM matrices
ℹ Indexing annotation: GCF_000001405.40_GRCh38.p14_genomic.gff.gz
Indexed 4,932,571 annotations across 705 contigs
ℹ Using indexed genome access: GCF_000001405.40_GRCh38.p14_genomic.fna.gz
ℹ Processing 705 contigs in parallel (8 processes)
Merging output: 202,594 (11.89%) scored + 45,650 (2.68%) omitted = 248,244 (14.56%) total introns for output files
ℹ Streaming classification complete: 202,594 introns classified
Total genes: 55,619, introns generated: 1,704,427

Intron Filtering Summary:
┌────────────────────────────┬────────────┬────────────┐
│ Category                   │ Included   │ Excluded   │
├────────────────────────────┼────────────┼────────────┤
│   Duplicates               │          0 │  1,457,363 │
│   Too short                │          0 │        240 │
│   Ambiguous bases          │          0 │          4 │
│   Non-canonical            │        525 │          0 │
│   Overlapping              │          0 │          0 │
│   Alternative isoform      │          0 │     45,211 │
├────────────────────────────┼────────────┼────────────┤
│ Total excluded             │            │  1,502,818 │
│ Retained for scoring       │            │    201,414 │
└────────────────────────────┴────────────┴────────────┘


Classification Results (threshold: 90.0%):
┌──────────────────────┬───────────┬────────────┐
│ Type                 │ Count     │ Percentage │
├──────────────────────┼───────────┼────────────┤
│ U12-type (total)     │       702 │      0.35% │
│ U12-type (AT-AC)     │       185 │      0.09% │
│ U2-type              │   201,892 │     99.65% │
├──────────────────────┼───────────┼────────────┤
│ Total                │   202,594 │    100.00% │
└──────────────────────┴───────────┴────────────┘

Sequence extraction only

If only the intron sequences are desired, use the extract subcommand which skips classification and produces only a subset of the output files:

intronIC extract -g genome.fa -a annotation.gff -n species

Using configuration files

For complex or reproducible runs, use a YAML configuration file:

# Generate a template configuration file
intronIC --generate-config > my_config.yaml

# Edit my_config.yaml, then run:
intronIC --config my_config.yaml -g genome.fa -a annotation.gff -n species

Example configuration:

scoring:
  threshold: 90.0
  exclude_noncanonical: false

  score_adjustment:
    enabled: true

extraction:
  flank_length: 100
  feature_type: both

training:
  ensemble:
    n_models: 42
  eval_mode: nested_cv

performance:
  processes: 8

Advanced: Custom normalization (rarely needed)

For most species, the default settings work well. The v3 model bundle ships with a multispecies fallback scaler that is used automatically for very small inputs (fewer than 200 scoreable introns), so single-intron / tiny-annotation runs work out of the box.

Two cases where you might want to override the default:

Reproducible normalization across runs on genome subsets

# First run: fit and save adaptive normalizer on full genome
intronIC -g genome.fa -a annotation.gff -n species \
         --normalizer-mode adaptive --save-normalizer

# Subsequent runs: reuse the normalizer
intronIC -g subset.fa -a subset.gff -n species \
         --load-normalizer species.normalizer.pkl

Force the bundled multispecies scaler

If you want to suppress per-species z-score shifts entirely (e.g., for U12-absent / outlier genomes where adaptive can compress the U2 distribution):

intronIC -g genome.fa -a annotation.gff -n species \
         --normalizer-mode human

In a v3 bundle, --normalizer-mode human resolves to the bundled multispecies fallback scaler.

Note: Both are advanced features. For standard analyses on normal-sized genomes, default settings are correct.

Parallel processing

Speed up analysis with parallel processes (streaming mode is default and scales efficiently):

intronIC -g genome.fa -a annotation.gff -n species -p 8

The -p flag parallelizes the entire extraction and scoring pipeline. With streaming mode (default), using -p 5-10 typically provides 2-3× speedup with moderate memory usage.

Memory modes

--streaming and --in-memory produce bit-identical classifications since v2.4 (covered by tests/integration/test_streaming_equivalence.py); the choice is purely a runtime/memory tradeoff. Reference run on full human GRCh38.p13 + NCBI RefSeq GFF, 257k scored introns, -p 5, default v2.7 bundle:

Mode	Wall time	Peak RSS
`--streaming` (default)	~40 min	~5.3 GB
`--in-memory`	(not re-measured for v2.7; expected similar wall time, roughly 2× peak memory)

Streaming mode (default)

# Streaming mode is automatic — no flag needed
intronIC -g GRCh38.fa.gz -a gencode.gff3.gz -n homo_sapiens -p 8

Streaming writes intron sequences to a temporary on-disk SQLite database during extraction, keeps only scoring motifs in memory, and parallelizes each phase (extraction, BG correction, adaptive-normalizer fit, first-pass classification, mode-separation second pass) per-contig.

In-memory mode

intronIC -g genome.fa -a annotation.gff -n species --in-memory

In-memory loads all intron sequences into memory at extraction time. It is also the path used internally by --sequences and --bed input modes (those bypass the per-contig streaming pipeline). On small single-contig inputs the per-contig overhead of streaming mode means in-memory is somewhat faster; on multi-contig genomes the two are roughly tied at typical parallelism levels.

Many additional options exist for a variety of use cases. Run intronIC --help for additional details and/or see the Full usage info page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example usage

Example usage

Quick test

Test data for manual runs

Basic usage

Classification (recommended for most users)

Training a new model

Extracting intron sequences only

Sequence extraction only

Using configuration files

Advanced: Custom normalization (rarely needed)

Reproducible normalization across runs on genome subsets

Force the bundled multispecies scaler

Parallel processing

Memory modes

Streaming mode (default)

In-memory mode

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally