This Nextflow DSL2 pipeline performs RNA-seq quality control, genome alignment, genome-guided de novo transcriptome assembly with Trinity, coding sequence prediction with TransDecoder, completeness assessment with BUSCO, protein similarity search with BLASTx, sample-level quantification with Salmon, and differential expression with DESeq2 (via tximport). Optional featureCounts is provided if a gene annotation is available.
- FastQC per-sample QC and MultiQC aggregation
- HISAT2 genome index and paired-end alignment, sorted/indexed BAMs
- Trinity genome-guided assembly across all BAMs
- TransDecoder ORF prediction
- BUSCO transcriptome completeness
- BLASTx annotation against a reference protein FASTA
- Salmon quantification against assembled transcripts
- DESeq2 differential expression using tximport
- Optional featureCounts quantification (requires GTF/GFF)
- Nextflow 22.10+ (
nextflow -version) - Either Conda/Mamba (recommended) or Docker/Singularity
CSV with header and four columns: sample,condition,fq1,fq2.
- sample: short sample identifier (no spaces)
- condition: group label (e.g.,
Mfor male,Ffor female) - fq1: path to read 1 FASTQ (gzip accepted)
- fq2: path to read 2 FASTQ (gzip accepted)
Example:
sample,condition,fq1,fq2
F1,F,data/IT_F1_1.fq.gz,data/IT_F1_2.fq.gz
F2,F,data/IT_F2_1.fq.gz,data/IT_F2_2.fq.gz
F3,F,data/IT_F3_1.fq.gz,data/IT_F3_2.fq.gz
M1,M,data/IT_M1_1.fq.gz,data/IT_M1_2.fq.gz
M2,M,data/IT_M2_1.fq.gz,data/IT_M2_2.fq.gz
M3,M,data/IT_M3_1.fq.gz,data/IT_M3_2.fq.gz
Run with Conda:
nextflow run main.nf -profile standard,conda \
--genome data/genome.fasta \
--samplesheet data/samplesheet.csv \
--ref_proteins data/ref_proteins.fasta \
--busco_db insecta_odb10 \
--outdir results \
--trinity_prefix gg_trinity \
--max_intron 10000 \
--de_case_label M --de_control_label FOptional: featureCounts if you have a gene annotation:
nextflow run main.nf -profile standard,conda \
--genome data/genome.fasta \
--samplesheet data/samplesheet.csv \
--ref_proteins data/ref_proteins.fasta \
--busco_db insecta_odb10 \
--featurecounts_gtf data/annotation.gtfresults/qc/fastqc/— FastQC reports per sampleresults/qc/— MultiQC reportresults/genome_index/— HISAT2 genome indexresults/*.bamand*.bam.bai— Sorted and indexed BAMs per sampleresults/gg_trinity_assembly/Trinity-GG.fasta— Assembled transcriptsresults/gg_trinity_assembly/Trinity-GG.fasta.transdecoder.*— TransDecoder outputsresults/busco/— BUSCO results (busco_out/)results/blastdb/— BLAST protein databaseresults/blastx.outfmt6— BLASTx tabular hits (outfmt 6)results/salmon_index/— Salmon index of assembled transcriptsresults/salmon_quant/— Per-samplequant.sfresults/de/— Differential expression:deseq2_results.tsv,normalized_counts.tsv,plots/results/counts.txt— featureCounts matrix (if--featurecounts_gtfis provided)
--genome(string): Path to genome FASTA--samplesheet(string): CSV withsample,condition,fq1,fq2--ref_proteins(string): Path to protein FASTA for BLASTx DB--busco_db(string): BUSCO lineage name (e.g.insecta_odb10) or local lineage path--outdir(string): Output directory (default:results)--trinity_prefix(string): Prefix for assembly output directory name--max_intron(int): Trinity--genome_guided_max_intron(default: 10000)--de_case_label(string): Case condition label (default:M)--de_control_label(string): Control condition label (default:F)--featurecounts_gtf(string|null): GTF/GFF path for featureCounts--max_cpus(int): Max CPUs per process (default: 16)--max_memory(string): Max memory per process (default:100.GB)
- DE is performed at transcript-level using Salmon + tximport + DESeq2; adjust
--de_case_labeland--de_control_labelto match your groups. - BUSCO may download lineage data on first run; set
--busco_dbto a local path to avoid downloads. - featureCounts requires a valid annotation (GTF/GFF). Counting against a FASTA is not supported.