Effortlessly normalize gigantic (or tiny) NGS datasets β one pipeline, one matrix, no stress.
giganorm is a robust, SLURM-optimized pipeline for high-throughput, region-specific expression quantification (TPM, RPKM, or FPKM) that works equally well for hundreds, thousands, or just a few NGS samplesβno project is too big or too small. Supporting both automatic SRA/ENA data retrieval and direct processing of user-supplied FASTQ files, giganorm adapts to a variety of data sources while ensuring all FASTQs are preprocessed with fastp for high-speed adapter trimming and quality filtering, and quality-checked with FastQC. From data download or FASTQ input, through STAR-based alignment and bedtools quantification over custom BED-defined regions, giganorm automates every step and logs all actions for reproducibility. The pipelineβs smart SLURM integration enables efficient scaling and batching for massive datasets, while an optional local mode allows for smaller projects or rapid testingβmaking giganorm a fit for both large studies and one-off analyses. User-friendly flags such as --keep-intermediate, --input_fastq, and --local give you full control over input types, intermediate file retention, and execution mode, and features like auto-conversion of BED formats and clear error handling minimize setup friction. At the end of each run, giganorm merges all per-sample results into a single, ready-to-analyze matrixβso you never need extra scripts or manual steps. Whether youβre quantifying genes, exons, enhancers, APA events, or any user-defined features, giganorm ensures accurate, normalized results with built-in QC, scalable processing, and a mascot-powered stamp of approvalβso you can focus on the science, not the workflow.
- Parallelized download, alignment, and quantification (SLURM arrays)
- Automatic download (SRA with ENA fallback)
- Automatic BED formatting fixes (tabs/spaces)
- Skips samples and steps already completedβno recomputation
- Batch-limited job submission (no cluster overflow)
- All-in-one matrix output for any metric (TPM/RPKM/FPKM)
- Optionally keeps or cleans up all intermediate files
- Easy, robust, and fun (Norman the GigaDog says so!)
| Dependency | Description | Website/Install |
|---|---|---|
| Bash | Shell interpreter (v4+ recommended) | bash |
| SLURM | Workload manager for cluster arrays | slurm |
| fastp | Quality control for FastQ data | fastp [1] |
| fastqc | Quality control checks on raw sequence data | fastqc [2] |
| STAR | RNA-seq aligner | STAR [3] |
| samtools | BAM file manipulation | samtools [4] |
| bedtools | Genomic intervals utility | bedtools[5] |
| sra-tools | For prefetch, fasterq-dump |
sra-tools [6] |
| wget | Command-line downloader | wget |
| WebProxy | proxy server with web-based configuration | WebProxy[7] |
| awk, grep | Standard text-processing tools | Built-in to most Linux/Unix systems |
β‘οΈ Tip: Use Conda or your cluster modules to install everything.
git clone https://github.com/jssprrt/GIGANORM.git
cd GIGANORM
chmod +x giganormMake sure the INDEX file is already generated for your target species using STAR
IF Not, please use the following command to do so:
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir /path/to/STARindex --genomeFastaFiles /path/to/genome_fasta_file.fa --sjdbGTFfile /path/to/genome_annotation_gtf.gtfShow help:
./giganorm --help
############################################################################
# #
# βββββββ βββ βββββββ ββββββ ββββ βββ βββββββ βββββββ ββββ ββββ #
# ββββββββ βββββββββββ βββββββββββββ βββββββββββββββββββββββββ βββββ #
# βββ ββββββββββ ββββββββββββββββββ ββββββ ββββββββββββββββββββββ #
# βββ βββββββββ ββββββββββββββββββββββββ ββββββββββββββββββββββ #
# ββββββββββββββββββββββββ ββββββ ββββββββββββββββββ ββββββ βββ βββ #
# βββββββ βββ βββββββ βββ ββββββ βββββ βββββββ βββ ββββββ βββ #
# #
# GIGANORM: When your datasets are so big, #
# only a pipeline with 'giga' energy can #
# chew through them (and burp up a matrix). #
# #
# "Go big, get normalized." #
# #
############################################################################
Usage:
./giganorm <srr_list.txt> <coords.bed> <STAR_index_dir> <output_dir> <batch_size> <metric>
./giganorm [--keep-intermediate] <srr_list.txt> <coords.bed> <STAR_index_dir> <output_dir> <batch_size> <metric>
./giganorm [--keep-intermediate] [--input_fastq] <fastq_files.txt> <coords.bed> <STAR_index_dir> <output_dir> <batch_size> <metric>
./giganorm --local <srr_list.txt> <coords.bed> <STAR_index_dir> <output_dir> <metric>
./giganorm --local [--input_fastq] <fastq_files.txt> <coords.bed> <STAR_index_dir> <output_dir> <metric>
./giganorm --local [--keep-intermediate] [--input_fastq] <fastq_files.txt> <coords.bed> <STAR_index_dir> <output_dir> <metric>
Arguments:
srr_list.txt File with one SRR accession per line.
fastq_files.txt (with --input_fastq) List of FASTQ file paths: single-end (one per line) or paired-end ("fq1::fq2" per line).
coords.bed BED file of regions.
STAR_index_dir Path to STAR genome index directory.
output_dir Root directory for outputs.
batch_size Max concurrent tasks (for SLURM mode).
metric TPM, RPKM, or FPKM.
Options:
--input_fastq Use a FASTQ file list instead of SRR list.
--keep, --keep-intermediate Keep all intermediate files.
--local Run all samples serially (no SLURM; for both SRR and FASTQ).
-h, --help Show this help message and exit.Prepare:
example_fastq_list - one fastq file per line (path/to/sampleName.fastq.gz for SE, and path/to/sampleName_1.fastq.gz::path/to/sampleName_2.fastq.gz for PE)
example_srr_list.txt β one SRA/ENA accession per line (no headers)
example_coords.bed β regions to quantify (chr, start, end, region_name; tab or space delimited)Run:
./giganorm /path/to/example_srr_list.txt /path/to/example_coords.bed /path/to/STARindex /path/to/output_dir 3 TPMAll samples will be processed in parallel batches (3 in this toy example)
Per-sample metrics appear in /path/to/output_dir/TPM/SRRxxxxxx.TPM.txt
A single merged matrix will be generated: /path/to/output_dir/TPM/combined_TPM_matrix.tsv
All logs are in /path/to/output_dir/logs/
./giganorm --keep-intermediate /path/to/example_srr_list.txt /path/to/example_coords.bed /path/to/STARindex /path/to/output_dir 3 TPMThis will run in serial order:
Make sure when running locally do not define <batch_size>
./giganorm --local --keep-intermediate /path/to/example_srr_list.txt /path/to/example_coords.bed /path/to/STARindex /path/to/output_dir TPMThis will run locally for each provided fastq file in serial order, if --local is not defined, SLURM jobs will be submitted for respective samples
./giganorm --local --keep-intermediate --input_fastq /path/to/example_fastq_list.txt /path/to/example_coords.bed /path/to/STARindex /path/to/output_dir TPMPer-sample metric files:
output_dir/TPM/SRR12345.TPM.txtCombined matrix:
output_dir/TPM/combined_TPM_matrix.tsvLogs:
output_dir/logs/ /^-----^\
V o o V βWoof! I sniffed out your counts.β
| Y | Norman the GigaDog approves this pipeline.
\ Q / All data normalized. Tail is wagging.
/ - \
| \
| \ "If your files are too big,
|| (___\ just giganorm it!"
- Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884βi890. https://doi.org/10.1093/bioinformatics/bty560
- Andrews, S. (2010). FastQC: A quality control tool for high-throughput sequence data. https://github.com/s-andrews/FastQC
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PMID: 23104886; PMCID: PMC3530905.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PMID: 19505943; PMCID: PMC2723002.
- Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PMID: 20110278; PMCID: PMC2832824.
- NCBI SRA Toolkit Development Team (2024). SRA Toolkit. National Center for Biotechnology Information. GitHub Repository. https://github.com/ncbi/sra-tools
- bp2008. (2024). WebProxy. GitHub. https://github.com/bp2008/WebProxy