Skip to content

aStudyingTurtle/redicat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

REDICAT · RNA Editing Cellular Assessment Toolkit

REDICAT is a high-throughput Rust toolkit for exploring RNA editing events in both bulk and single-cell sequencing experiments. The command-line interface ships three production pipelines—bulk, bam2mtx, and call—all built on a shared library that emphasizes lock-free parallelism, sparse data structures, and deterministic output.

Citation

If REDICAT supports your research, please cite:

Wei, T., Li, J., Lei, X., Lin, R., Wu, Q., Zhang, Z., Shuai, S., & Tian, R. (2025). Multimodal CRISPR screens uncover DDX39B as a global repressor of A-to-I RNA editing. Cell Reports, 44(7). https://doi.org/10.1016/j.celrep.2025.116009

Highlights

  • Canonical-contig smart defaults: By default every command operates on the primary human chromosomes (chr1–chr22, chrX, chrY, chrM). Add --allcontigs to opt into decoys, patches, or spike-ins when you need full genome coverage.
  • On-the-fly site vetting: Depth, N-content, and editing-support thresholds mirror the bulk heuristics so only confident loci advance to matrix assembly.
  • Fearless concurrency: Rayon thread pools, per-thread chunk reducers, and reader pooling keep pileup traversal and sparse matrix assembly saturated on modern CPUs.
  • Dynamic Rayon work queue: ParGranges now flattens genomic tiles into a global work queue so bulk keeps every core busy even when only a handful of contigs remain in flight.
  • Parallel AnnData assembly: bam2mtx now collects per-thread chunk results, merges them with a parallel sparse assembler, and writes .h5ad outputs without a single threaded bottleneck or extra global channels.
  • Run-aware sparse merging (New): The AnnData converter now persists sorted triplet runs to disk and performs a streaming multi-way merge when finalising CSR layers, keeping peak RSS close to the sparse output size and preventing the single-threaded 30+ minute stall that previously occurred after "Processed 100%" log events.
  • Depth-aware chunking: bam2mtx streams TSV manifests with Polars, accumulates position weights (DEPTH + REF_SKIP + FAIL) until the --chunksize budget is spent, and automatically falls back to the hotspot-friendly --chunk-size-max-depth batches (default 2) whenever NEAR_MAX_DEPTH=true.
  • Local-time observability: Console logs now emit system time zone timestamps and guaranteed 30-minute status beacons that report remaining chunks alongside outstanding NEAR_MAX_DEPTH loci for long-running jobs.
  • Depth-cap skip audit: Any site whose pileup depth meets or exceeds --max-depth is skipped before UMI aggregation, recorded to a sibling <name>_skiped_sites.txt manifest, and summarised in the final log output for easy triage.
  • Layered architecture: The library is split into core (shared utilities & sparse ops), engine (parallel schedulers and position primitives), and pipeline (bam2mtx & call workflows), keeping reusable pieces lightweight and testable.
  • Sparse-first analytics: All matrix work is written against CSR matrices with adaptive density hints so memory usage tracks the number of edited positions, not the theoretical genome size.
  • Strand-collapsed editing counts (New): The call pipeline now merges complementary *0/*1 strands before assigning counts to the ref, alt, and others matrices, guaranteeing strand-invariant allele totals.
  • Optimized sparse operations: Uses two-pointer algorithms for sparse matrix operations (O(nnz) complexity) instead of HashMap-based approaches, with SmallVec for temporary allocations and FxHashMap for faster non-cryptographic hashing.
  • Minimal memory overhead: Arc clones are kept to minimum, performance-critical paths avoid unnecessary allocations, and streaming converters maintain bounded memory usage.
  • AnnData-native exports: The toolkit produces .h5ad files that slot directly into Python-based workflows (Scanpy, scVI, Seurat via conversion).

Repository Layout

src/
  main.rs                 # CLI entry point and subcommand wiring
  commands/
    bulk/                 # Bulk depth CLI (args, processor, orchestration)
    bam2mtx/              # Matrix converter CLI (args, workflow, engine)
    call/                 # RNA editing pipeline CLI (args + pipeline)
    common.rs             # Shared CLI/runtime helpers (threads, contigs)
  lib/
    core/                 # Concurrency, IO, errors, read filters, sparse utilities
    engine/               # Parallel genomic range scheduler + position structs
    pipeline/
      bam2mtx/            # BAM ➜ sparse matrix processing pipeline
      call/               # RNA editing analytics pipeline

Installation

# Clone and build
git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release

# Optional: install into Cargo's bin directory
cargo install --path .

Binary artifacts live at target/release/redicat after compilation.

Quick Start

# Discover high-confidence editing sites on canonical chromosomes
redicat bulk sample.bam -o sites.tsv.gz --threads 16

# Generate single-cell mismatch matrices using a two-pass workflow
redicat bam2mtx --bam sample.bam --barcodes barcodes.tsv \
                --output counts.h5ad --two-pass --threads 16

# Tune coverage heuristics with `--min-depth`, `--max-n-fraction`, and `--editing-threshold` to mirror bulk filtering when necessary.

# Run the RNA editing analysis pipeline on the produced AnnData matrix
redicat call --input counts.h5ad --output results.h5ad \
             --fa GRCh38.fa --site-white-list sites.tsv.gz

Add --allcontigs to any command that should consider every contig in the BAM header instead of the canonical subset.

Subcommands

bulk

Calculates per-base depth and nucleotide counts across a BAM/CRAM. Filtering options allow you to tune MAPQ, base quality, minimum depth, and an editing-specific heuristic. The revamped scheduler flattens genomic tiles into a unified Rayon work queue and forces small stealable tasks via with_min_len(1) + adaptive with_max_len, keeping CPU utilisation high for long-running bulk traversals. Results stream to a bgzip-compressed TSV with near-max-depth flags computed from the larger of the filtered depth or the full pileup support (DEPTH + REF_SKIP + FAIL). near_max_depth flags now key off that max-depth metric, ensuring the 1% ceiling tracks whichever of the filtered or raw pileup coverage is higher. Positions whose filtered depth reaches the --max-depth ceiling are explicitly marked as NEAR_MAX_DEPTH, providing an audit trail for loci clipped by the depth guard. Some libraries of bulk were taken form the perbase project.

Option reference:

Short Long Description Default
<reads> Input indexed BAM/CRAM to analyze. required
-o --output Output path (bgzip added when missing). required
-t --threads Worker threads for pileup traversal. 10
-c --chunksize Ideal basepairs per worker (ParGranges chunk). 500000
-Q --min-baseq Base-quality floor; lower bases counted as N. 30 (implicit)
-q --mapquality Minimum mapping quality accepted. 255
-z --zero-base Emit 0-based coordinates instead of 1-based. false
-D --max-depth Depth ceiling; positions within 1% of the cap are flagged as NEAR_MAX_DEPTH. 8000
-d --min-depth Minimum coverage required to report a site. 10
-n --max-n-fraction Maximum tolerated N fraction (depth / value). 20
-a --all Report all sites rather than editing-enriched subset. false
-et --editing-threshold Depth divisor when checking for editing support. 1000
-A --allcontigs Traverse every contig instead of canonical set. false

bam2mtx

Converts barcoded BAMs into sparse AnnData matrices (per base, per barcode, strand-aware). Accepts precomputed site lists or can bootstrap one via --two-pass, which internally runs bulk with the same contour filters. Parallel chunk workers now reduce their results locally and a multi-threaded sparse assembler materialises the final CSR layers before the .h5ad file is written, avoiding the previous channel-based bottleneck and keeping memory predictable.

During long conversions the console prints system-local timestamps and, in addition to 5% progress beacons, guarantees a status snapshot every 30 minutes summarising remaining chunks and unprocessed NEAR_MAX_DEPTH loci so operators can watch backlog decay in real time. At completion the command emits a <output>_skiped_sites.txt sibling file listing the contig, 1-based position, and observed depth of every locus that was skipped because its pileup depth met or exceeded --max-depth.

Recent Memory Optimizations (v0.3.1):

  • UMI String Interning: Identical UMI sequences are now deduplicated in memory using Arc<str>, reducing memory usage by 50-70% for high-depth positions with redundant UMIs.
  • Adaptive HashMap Sizing: Hash maps for counts and UMI consensus now use dynamic capacity estimation based on whether a chunk contains high-depth positions, preventing over-allocation.
  • Aggressive Triplet Spilling: The in-memory triplet threshold has been reduced from 500K to 100K (with a chunk-size multiplier of 2× instead of 32×), ensuring more frequent disk flushes and lower peak memory usage.
  • Enhanced Monitoring: Each chunk now logs detailed statistics including position count, high-depth site count, and total weight, making it easier to identify memory-intensive workloads.

Highlights:

  • TSV positions are filtered to canonical contigs unless --allcontigs is specified.
  • Depth/N/editing thresholds drop low-quality loci before matrix assembly, matching the first-pass bulk defaults.
  • --two-pass scouts sites with bulk --max-depth 8000, trimming runaway pileups before the heavy second pass.
  • Reader pools and chunk-aware parallelism keep pileups cache-friendly while chunk staging bounds memory.
  • Output matrices respect the density hint supplied via --matrix-density, and the streaming converter remaps cell/position indices on the fly before materialising CSR layers.
  • Info-level progress logs report chunk-level completion percentage, detailed chunk statistics, and ETA so long runs stay observable.

Option reference:

Short Long Description Default
-b --bam Input indexed BAM/CRAM file. required
--tsv Optional site list TSV (columns CHR and POS). optional
--barcodes Cell barcode whitelist (one per line). required
--two-pass Run bulk first to build a fresh site list. false
-o --output Output AnnData (.h5ad) path. required
-t --threads Worker threads for pileup + AnnData writing. 10
-q --min-mapq Minimum mapping quality per read. 255
-Q --min-baseq Minimum base quality; lower bases count as N. 30
-d --min-depth Minimum non-N coverage to keep a site. 10
-n --max-n-fraction Maximum tolerated ambiguous fraction (depth / value). 20
-et --editing-threshold Ensures at least two bases exceed depth / editing-threshold. 1000
-D --max-depth Cap on pileup traversal depth during matrix assembly (default 655,360). Sites with depth ≥ cap are skipped, logged, and mirrored to <output>_skiped_sites.txt. 65536
-S --stranded Treat UMIs as strand-aware. false
--umi-tag BAM tag containing UMI sequence. UB
--cb-tag BAM tag containing cell barcode. CB
-r --reference Reference FASTA (required for CRAM). optional
-c --chunksize Genomic position weight budget per processing chunk (applied to standard loci). 100000
--chunk-size-max-depth Maximum hotspot chunk length when NEAR_MAX_DEPTH is true in the TSV. Updated default to 2 for faster hotspot turnover. 2
--matrix-density Hint for CSR buffer sizing. 0.005
-A --allcontigs Include decoys and noncanonical contigs. false

call

Executes the full RNA editing pipeline on .h5ad inputs: variant annotation, strand-aware ref/alt counting, CEI computation, and mismatch statistics. Validation steps ensure reference FASTA, white-list TSV, and input AnnData all exist and are readable before work begins.

Recent updates collapse opposing strand layers (A0 + A1, etc.) before counts are distributed to the ref, alt, and others matrices. This guarantees that allelic support is identical regardless of BAM strand orientation and avoids spuriously split coverage in downstream metrics.

Editing Logic (Updated):

  • redicat call --editingtype now enforces complementary substitutions for all six editing signatures. For example, --editingtype AG keeps A→G events alongside the strand-complementary T→C observations, while discarding loci whose reference base is neither A nor T. Equivalent complementary rules exist for AC, AT, CA, CG, and CT.
  • Optimized ref/alt/obs matrix calculation (v0.3.2+): The matrices use an efficient mask-based sparse operation approach inspired by the Python implementation, achieving O(nnz) complexity. For AG editing:
    • When ref_base=A: ref layer = A counts, alt layer = G counts, obs layer = T+C counts
    • When ref_base=T: ref layer = T counts, alt layer = C counts, obs layer = A+G counts
    • Sites with ref_base=G or C are excluded
  • One-hot encoding masks are created for ref/alt/others positions, then applied via element-wise sparse multiplication to each base matrix, maintaining full sparsity and dramatically improving performance for large datasets.
  • Similar logic applies to all other editing types (AC, AT, CA, CG, CT), ensuring strand-invariant and biologically accurate allele counting with optimal efficiency.
  • The site white-list contributes an is_editing_site flag inside var, while the mismatch thresholds (--max-other-threshold, --min-coverage, --min-edited-threshold, --min-ref-threshold) produce a filter_pass flag. Only loci marked with both flags feed the observation-level ref, alt, and others totals as well as CEI.
  • The generated AnnData therefore includes a var table with is_editing_site, filter_pass, and Mismatch columns, and an obs table whose ref/alt/others columns and CEI metric reflect exclusively the confidently filtered editing sites.

Important options:

  • --editingtype · choose among AG, AC, AT, CA, CG, CT editing signatures.
  • --min-coverage · per-site coverage filter before CEI calculation.
  • --threads · configure Rayon pool size (defaults to 2 for stability).

Performance Notes

  • Sparse matrix column sums for site statistics (New v0.3.2): Site-level mismatch statistics (XX_ref, XX_alt, XX_others) are now computed using efficient sparse matrix column sum operations directly on the ref/alt/others layers, achieving O(nnz) complexity and eliminating the need for DataFrame row iteration.
  • Parallel tree reduction for sparse matrix sums (New v0.3.2): Coverage matrix calculation now uses a divide-and-conquer parallel tree reduction strategy, achieving O(log n) parallel depth instead of O(n) sequential operations. For 8 base layers, this reduces from 7 sequential additions to 3 parallel levels with 4-way parallelism at the first level, dramatically improving performance while maintaining full sparsity.
  • Mask-based sparse matrix operations (New v0.3.2): The call pipeline now uses one-hot encoding masks combined with element-wise sparse multiplication for ref/alt/obs matrix calculation, achieving O(nnz) complexity instead of O(n_obs × n_vars). This dramatically improves performance for large sparse datasets while maintaining full correctness and sparsity.
  • UMI String Interning (New): Identical UMI sequences are deduplicated in memory using Arc<str>, reducing memory footprint by 50-70% for datasets with repetitive UMIs at high-depth positions.
  • Adaptive HashMap Capacity (New): Hash maps dynamically adjust their initial capacity based on chunk characteristics (high-depth vs normal), preventing over-allocation while maintaining performance.
  • Parallel sparse reduction (New): bam2mtx merges per-thread chunk outputs with a lock-free sparse assembler, so CSR materialisation scales with core count without staging everything in a single consumer thread.
  • Aggressive Triplet Spilling (New): The streaming matrix builder now uses a lower spill threshold (100K triplets, minimum 50K cap) to flush data to disk more frequently, dramatically reducing peak memory usage during large conversions.
  • Depth-cap awareness (New): bulk now marks any site whose filtered depth reaches the configured maximum as NEAR_MAX_DEPTH, complementing the existing 1% proximity heuristic and making depth clipping visible downstream.
  • Streaming CSR materialisation (New): Triplet runs are merged with a heap-based iterator that deduplicates row/column pairs on the fly, so final layer construction stays parallel-friendly and never loads all triplets into RAM at once.
  • Enhanced Chunk Monitoring (New): Each chunk logs position count, high-depth site count, and total weight, providing visibility into workload characteristics and helping diagnose memory-intensive regions.
  • Reader pooling: Each worker thread checks out an IndexedReader from a pool, avoiding reopen/seek penalties when iterating across genomic tiles.
  • Triplet batching: Sparse matrices are assembled from thread-local batches of (row, col, value) triplets, eliminating cross-thread contention, and the final merge operates fully in parallel without an intermediate channel stage.
  • Adaptive chunking: Dual CLI knobs (--chunksize for the depth-weighted standard loci budget and --chunk-size-max-depth for hotspot batch caps, now defaulting to 2) govern the parallel granularity for bam2mtx (and the batch size for AnnData writes), while the refactored ParGranges scheduler flattens tiles into a single Rayon queue so bulk honors --chunksize without serial bottlenecks. Normal chunk weights now match the manifest’s DEPTH + REF_SKIP + FAIL metric for tighter alignment with bulk heuristics.
  • Chunk-level fetch & depth guards: bam2mtx batches contiguous sites per contig, records observed depth, and—in --two-pass mode—relies on a first-pass bulk run capped at --max-depth 8000 plus in-run pileup guards to prune oversaturated loci before dense pileups reach the matrix stage.
  • Progress-aware logging: Chunk processors emit periodic % complete + ETA updates plus detailed per-chunk statistics so multi-hour conversions remain easy to follow from the console.

Testing & Data

# REDIportal3 white-list, for use with `call`
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/rediportal3.tsv.gz

# 10X 3' 20K_PBMC data
# chr22
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/chr22.bam
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/chr22.bam.bai
# cell barcodes
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/barcodes.tsv.gz

A minimal BAM (test/chr22.bam) and index are included for smoke tests. Run cargo test to execute unit tests that exercise pileup traversal and CLI argument parsing.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published