- Introduction
- Installation
- Usage
- Inputs
- Output Structure
- Managing storage with watch_and_transfer.sh
- Example SLURM wrapper: run.sh
nf-sra_screen is a Nextflow pipeline for taxon‑focused screening and assembly of public SRA runs and/or local FASTQ files, followed by taxonomical annotation and binning.
Given:
- a list of SRA accessions and/or a table of local FASTQ files,
- a NCBI taxonomy snapshot and NCBI <-> GTDB mapping tables,
- a UniProt DIAMOND database
- (Optional) Sandpiper and SingleM marker‑gene databases for pre-screening,
the pipeline will:
- Discover and filter suitable SRR runs from SRA metadata (short‑read, ONT, PacBio CLR/HiFi).
- (Optional) Pre‑screen samples using Sandpiper and/or SingleM against a GTDB‑derived phylum list.
- Assemble reads with:
- metaSPAdes for short reads
- metaFlye for ONT and PacBio CLR
- myloasm for PacBio HiFi
- Annotate contigs with DIAMOND against UniProt and summarise with BlobToolKit.
- (Optional) Extract contigs matching user‑specified taxa into per‑taxon FASTA and ID lists.
- (Optional) Run multiple metagenome binners (MetaBAT2, ComeBin, SemiBin, Rosella) and reconcile them with DAS Tool.
- Collate a per‑sample
summary.tsvwith counts and failure/success notes, and post‑annotate it using scheduler info from the Nextflowtrace.tsv.
You can use the pipeline in four modes:
- Assembly only: just give SRA/FASTQ +
--taxdump+--uniprot_db. - Assembly + binning (
--binning): just give SRA/FASTQ +--taxdump+--uniprot_db. - Assembly + taxon screening (
--taxa): additionally provide GTDB mapping, and SingleM/Sandpiper databases. - Assembly + taxon screening + binning (
--taxa&--binning): additionally provide GTDB mapping, and SingleM/Sandpiper databases.
The top‑level orchestration is split into four named workflows in main.nf:
PRE_SCREENING– SRA metadata -> SRR selection -> optional Sandpiper/SingleM screening.ASSEMBLY– Assembly, DIAMOND, BlobToolKit, optional taxon extraction.BINNING– MetaBAT2, ComeBin, SemiBin, Rosella, DAS Tool, and binning note aggregation.SUMMARY– merges all success and failure notes into the final globalsummary.tsv.
- Nextflow:
== 25.04.8 - Plugins:
nf-boost@~0.6.0(configured innextflow.config)
- Container back‑end:
- Docker, or
- Singularity / Apptainer
- For the helper watcher scripts (
watch_and_transfer.sh/run.sh): a Slurm cluster with:sbatch,sacct,rsync,flock- a data‑copy partition (the example uses
-p datacp)
--taxdumpNCBI taxdump dir (nodes.dmp,names.dmp,taxidlineage.dmpor classical taxdump)--uniprot_dbUniProt DIAMOND database (.dmnd) (See the BlobToolKit documentation for how to build this)--gtdb_ncbi_map(with--taxa) Dir with NCBI -> GTDB crosswalk:ncbi_vs_gtdb_bacteria.xlsx,ncbi_vs_gtdb_archaea.xlsx,gtdb_r226.dicfrom GTDB download--sandpiper_db(with--taxa) Sandpiper db withsandpiper_sra.txt,sandpiper1.0.0.condensed.tsv--singlem_db(with--taxa) SingleM metapackage (e.g.S5.4.0.GTDB_r226.metapackage_20250331.smpkg.zb) All tools used by the pipeline are provided via containers defined in nextflow.config.
The pipeline can ingest SRA accessions and/or local FASTQ files in the same run. Internally these are merged before assembly.
Prepare a CSV with a single column sra, each row representing an SRA project (or study‑level) accession:
sra.csv
sra
PRJNAXXXXXX
SRPXXXXXX
ERRXXXXXX
Each row can be a project, study, or run The pipeline will query metadata and expand each project into multiple SRR runs internally.
The metadata files are written under:
<outdir>/metadata/<sra>
Prepare a TSV describing local FASTQ files.
fastq.tsv
sample read_type reads
A98 hifi a98.fastq.gz
B27 short read_1.fastq.gz,read_2.fastq.gz
C03 nanopore c03_pass.fastq.gz
D48 pacbio d48.fastq.gzsample: logical sample identifier.read_type: types of reads used for assembler selection. Use descriptive labels belowshort: short pair-end reads (Illumina, BGISEQ, DNBSEQ) (metaSPAdes),nanopore: Nanopore reads (metaFlye),pacbio: PacBio CLR reads (metaFlye),hifi: PacBio HiFi reads (myloasm).
reads: comma‑separated list of FASTQ paths (absolute or relative); at least one file per row is required. Two or more files are treated as paired‑end for SingleM/metaSPAdes, one as single‑end.
In FASTQ + screening mode (--fastq_tsv + --taxa), each sample is treated as
sra = sample
srr = sample
platform = UNKNOWN
model = read_type
strategy = UNKNOWN
assembler = read_typeSingleM is run in place of Sandpiper for these samples.
In FASTQ + no‑screening mode (--fastq_tsv), reads go straight into assembly and optionally binning.
Provide a CSV of target taxa if you want taxon‑specific screening and contig extraction:
taxa.csv
rank,taxa
phylum,Bacillota
class,Gammaproteobacteria
order,o__Chloroflexales
genus,g__Escherichia
Allowed ranks (case-insensitive)
realm,domain,superkingdom,kingdom,phylum,class,order,family,genus,species
Important
- If you do not supply
--taxa, the pipeline skips SingleM/Sandpiper and taxon‑specific extraction. - If you supply the taxon in GTDB style, the pipeline runs SingleM/Sandpiper but skips taxon‑specific extraction
nextflow run asuq/nf-sra_screen \
-profile <docker/singularity/local/slurm/...> \
--sra sra.csv \
--fastq_tsv fastq.tsv \
--taxdump /path/to/ncbi_taxdump_dir \
--uniprot_db /path/to/uniprot.dmnd \
--taxa taxa.csv \
--gtdb_ncbi_map /path/to/ncbi_vs_gtdb_xlsx_dir \
--sandpiper_db /path/to/sandpiper_db_dir \
--singlem_db /path/to/singlem_metapackage \
--outdir nf-sra_screen_results-profilenextflow profile (see below)--sraCSV with columnsralisting project accessions--fastq_tsvTSV with columns (sample,read_type,reads) listing sample reads--taxdumpDirectory containing NCBI taxdump files;jsonify_taxdump.pywill createtaxdump.json--uniprot_dbUniProt DIAMOND database (.dmnd) (Follow blobtools tutorial)--taxa(Optional) CSV with rank,taxa (NCBI or GTDB names). Use it if you want taxonomy screening--gtdb_ncbi_map(Optional) Directory with ncbi_vs_gtdb_bacteria.xlsx and ncbi_vs_gtdb_archaea.xlsx. For taxonomy screening--sandpiper_db(Optional) Directory with Sandpiper summary tables. For taxonomy screening--singlem_db(Optional) SingleM metapackage (e.g. S5.4.0.GTDB_r226.metapackage_20250331.smpkg.zb) For taxonomy screening--outdirOutput directory (default: ./output)--max_retriesMaximum number of retries per process (default: 3)--helpPrint the pipeline help message and exit.
local- Executor:
local docker.enabled = true- Small queue size and moderate resources (max_cpus=8, max_memory=16.GB).
- Executor:
slurm- Executor:
slurm singularity.enabled = true- Large queue size (
queueSize=2000) and increased resource caps.
- Executor:
oist- Includes
conf/oist.configfor OIST Deigo HPC settings.
- Includes
debug- docker.enabled = true
executor.queueSize = 1- Extended trace.fields for debugging.
test- For small regression tests.
Output structure
<output>/
metadata/
<sra>/
<sra>.filtered.csv
<sra>.skipped.csv
<sra>.FAIL.note # if metadata step failed
<sra>/<srr>/
# Screening
sandpiper_report.txt
sandpiper_output.tsv
sandpiper_decision.txt
singlem_taxonomic_profile.tsv
singlem_taxonomic_profile_krona*
singlem_output.tsv
# Assembly
assembly.fasta
assembly.gfa
spades.log / flye.log / myloasm.log
fastp.html # short-read only
# BlobToolKit
blobtools.csv
blobtools*.svg
# Taxon extraction (if --taxa)
summary.csv
*.ids.csv
*.fasta
# Binning (if --binning)
binning/
metabat/
comebin/
semibin/
rosella/
dastool/
metabat.note # if failed
comebin.note # if failed
semibin.note # if failed
rosella.note # if failed
dastool.note # if failed
binning_note.txt # aggregated notes
summary.tsv # global summary across all samples
execution-reports/
timeline.html
report.html
trace.tsvManaging storage with watch_and_transfer.sh
Long metagenomic runs can fill storage rapidly. The helper script watch_and_transfer.sh is designed to stream finished per‑sample output folders off the run directory to a longer‑term storage location and clean up safely.
Given:
RUN_DIR: the Nextflow run directory (where output/ and summary.tsv live)DEST_DIR: a larger storage area (e.g. object store or shared filesystem)INTERVAL_MINS: how often to scan for new samples
watch_and_transfer.sh will:
-
Acquire an exclusive lock in
RUN_DIR/.watch_and_transfer.lockso only one watcher instance runs per pipeline. -
Read
RUN_DIR/output/summary.tsvand, for each(sra,srr):- Skip samples already listed in
RUN_DIR/.processed_summary.tsv. - Skip samples that already have a pending transfer in
RUN_DIR/.pending_copy_jobs.tsv.
- Skip samples already listed in
-
Interpret the note column:
-
If
notestarts withdid not match the criteria: the run was filtered at the metadata stage; it is recorded as processed without any transfer.
-
If
noteis non‑empty andoutput/$sra/$srrdoes not exist, the run is considered failed/filtered with no outputs and is marked processed. -
Otherwise, the run is treated as a completed sample with outputs.
-
-
For each completed sample with an output directory:
-
Submits a Slurm job via sbatch on partition datacp:
rsync -a "${RUN_DIR}/output/$sra/$srr"/ "${DEST_DIR}/$sra/$srr"/ -
followed by:
rm -rf ${RUN_DIR}/output/$sra/$srr -
Attempt to rmdir the now‑empty
${RUN_DIR}/output/$sra directory.
-
Records (sra,srr,job_id) in .pending_copy_jobs.tsv.
- On each cycle,
check_pending_jobs:- Queries Slurm with
sacctfor all pending job IDs. - For jobs that finished with
State=COMPLETEDand ExitCode starting with 0, logs success. - Deletes the corresponding
slurm-<jobid>.outlog. - Appends
(sra,srr)to.processed_summary.tsv. - For jobs in transient states (PENDING/RUNNING/etc.), keeps them pending.
- For failed/cancelled/time‑out jobs, removes them from pending; the sample will be re‑submitted in a later cycle.
- Queries Slurm with
The script runs indefinitely in a loop:
while :; do
check_pending_jobs
move_output_to_storage
sleep "$INTERVAL_MINS" minutes
done- Slurm environment with:
sbatchsacct- a partition suitable for data transfer (the script uses
-p datacp; change if needed)
The pipeline must be writing summary.tsv to RUN_DIR/output/summary.tsv, which is the default when --outdir output and you run from RUN_DIR.
From a login node (ideally in a tmux/screen session):
bin/watch_and_transfer.sh RUN_DIR DEST_DIR INTERVAL_MINSExample:
bin/watch_and_transfer.sh \
/fast/youruser/project_X/run1 \
/long/yourgroup/project_X/archive \
10This will:
- Check every 10 minutes for new rows in
output/summary.tsv. - Start Slurm copy jobs as samples finish.
- Free space under
RUN_DIR/outputonce a copy is verified as successful. - Keep a small amount of state in:
RUN_DIR/.processed_summary.tsvRUN_DIR/.pending_copy_jobs.tsvRUN_DIR/.watch_and_transfer.lock
Example SLURM wrapper: run.sh
The repository includes an example wrapper run.sh showing how to run the pipeline and watcher together on a Slurm cluster.
What run.sh does
1.Defines user‑specific paths:
RUN_DIR='/fast/.../nf-sra_screen_run'
DEST_DIR='/long/.../nf-sra_screen_archive'
INTERVAL_MINS=10
NF_SRA_SCREEN='/path/to/nf-sra_screen' # clone of this repo- Installs a
trapso that when the script exits (successfully or not), it:
- Attempts to stop the background watcher process cleanly.
- Preserves the original Nextflow exit status.
- Changes into
RUN_DIRso that:
.nextflow.log,work/, andoutput/live there.watch_and_transfer.shcan findoutput/summary.tsvat the expected location.
- Starts the watcher in the background:
"${NF_SRA_SCREEN}/bin/watch_and_transfer.sh" \
"${RUN_DIR}" \
"${DEST_DIR}" \
"${INTERVAL}" \
> watch_and_transfer.log 2>&1 &and records its PID in watch_and_transfer.pid.
- Runs the Nextflow pipeline (with your chosen profile and parameters):
nextflow run asuq/nf-sra_screen \
-profile <docker/singularity/local/slurm/...> \
--sra sra.csv \
--fastq_tsv fastq.tsv \
--taxdump /path/to/ncbi_taxdump_dir \
--uniprot_db /path/to/uniprot.dmnd \
--taxa taxa.csv \
--binning \
--gtdb_ncbi_map /path/to/ncbi_vs_gtdb_xlsx_dir \
--sandpiper_db /path/to/sandpiper_db_dir \
--singlem_db /path/to/singlem_metapackage \
--outdir nf-sra_screen_results \
-resume- Exits with the same status code as the Nextflow run, triggering the
EXITtrap, which in turn stops the watcher.
To reuse this pattern:
-
Copy
run.shsomewhere in your project. -
Edit:
RUN_DIR: a scratch or fast filesystem path for the actual run.DEST_DIR: slower / archival filesystem for final results.NF_SRA_SCREEN: path to your clone of this repository.- The Nextflow command at the bottom (profile name, database paths, etc.).
- The Slurm partition used for data copy in
watch_and_transfer.sh(-p datacp) if your site uses a different name. - Submit
run.shitself as a Slurm job or run it on a login node withtmux(depending on your site policy). All heavy work is still done by Nextflow processes and the per‑sample transfer jobs.
Author / maintainer: Akito Shima (ASUQ), akito-shima[at]oist.jp
- iSeq
- SRA toolkit
- Sandpiper
- SingleM
- DIAMOND
- BlobToolKit
- fastp
- metaSPAdes / SPAdes
- Flye
- myloasm
- bowtie2
- minimap2
- samtools
- MetaBAT2
- ComeBin
- SemiBin
- Rosella
- DAS Tool
