Skip to content

asuq/nf-sra_screen

Repository files navigation

nf-sra_screen

Nextflow run with docker run with singularity

Table of contents

Introduction

nf-sra_screen is a Nextflow pipeline for taxon‑focused screening and assembly of public SRA runs and/or local FASTQ files, followed by taxonomical annotation and binning.

nf-sra_screen diagram

Given:

  • a list of SRA accessions and/or a table of local FASTQ files,
  • a NCBI taxonomy snapshot and NCBI <-> GTDB mapping tables,
  • a UniProt DIAMOND database
  • (Optional) Sandpiper and SingleM marker‑gene databases for pre-screening,

the pipeline will:

  1. Discover and filter suitable SRR runs from SRA metadata (short‑read, ONT, PacBio CLR/HiFi).
  2. (Optional) Pre‑screen samples using Sandpiper and/or SingleM against a GTDB‑derived phylum list.
  3. Assemble reads with:
    • metaSPAdes for short reads
    • metaFlye for ONT and PacBio CLR
    • myloasm for PacBio HiFi
  4. Annotate contigs with DIAMOND against UniProt and summarise with BlobToolKit.
  5. (Optional) Extract contigs matching user‑specified taxa into per‑taxon FASTA and ID lists.
  6. (Optional) Run multiple metagenome binners (MetaBAT2, ComeBin, SemiBin, Rosella) and reconcile them with DAS Tool.
  7. Collate a per‑sample summary.tsv with counts and failure/success notes, and post‑annotate it using scheduler info from the Nextflow trace.tsv.

You can use the pipeline in four modes:

  • Assembly only: just give SRA/FASTQ + --taxdump + --uniprot_db.
  • Assembly + binning (--binning): just give SRA/FASTQ + --taxdump + --uniprot_db.
  • Assembly + taxon screening (--taxa): additionally provide GTDB mapping, and SingleM/Sandpiper databases.
  • Assembly + taxon screening + binning (--taxa & --binning): additionally provide GTDB mapping, and SingleM/Sandpiper databases.

The top‑level orchestration is split into four named workflows in main.nf:

  • PRE_SCREENING – SRA metadata -> SRR selection -> optional Sandpiper/SingleM screening.
  • ASSEMBLY – Assembly, DIAMOND, BlobToolKit, optional taxon extraction.
  • BINNING – MetaBAT2, ComeBin, SemiBin, Rosella, DAS Tool, and binning note aggregation.
  • SUMMARY – merges all success and failure notes into the final global summary.tsv.

Installation

Requirements

  • Nextflow: == 25.04.8
  • Plugins:
    • nf-boost@~0.6.0 (configured in nextflow.config)
  • Container back‑end:
    • Docker, or
    • Singularity / Apptainer
  • For the helper watcher scripts (watch_and_transfer.sh / run.sh): a Slurm cluster with:
    • sbatch, sacct, rsync, flock
    • a data‑copy partition (the example uses -p datacp)

Database requirements

  • --taxdump NCBI taxdump dir (nodes.dmp, names.dmp, taxidlineage.dmp or classical taxdump)
  • --uniprot_db UniProt DIAMOND database (.dmnd) (See the BlobToolKit documentation for how to build this)
  • --gtdb_ncbi_map (with --taxa) Dir with NCBI -> GTDB crosswalk: ncbi_vs_gtdb_bacteria.xlsx, ncbi_vs_gtdb_archaea.xlsx, gtdb_r226.dic from GTDB download
  • --sandpiper_db (with --taxa) Sandpiper db with sandpiper_sra.txt, sandpiper1.0.0.condensed.tsv
  • --singlem_db (with --taxa) SingleM metapackage (e.g. S5.4.0.GTDB_r226.metapackage_20250331.smpkg.zb) All tools used by the pipeline are provided via containers defined in nextflow.config.

Usage

The pipeline can ingest SRA accessions and/or local FASTQ files in the same run. Internally these are merged before assembly.

1. Input: SRA samplesheet (--sra)

Prepare a CSV with a single column sra, each row representing an SRA project (or study‑level) accession:

sra.csv

sra
PRJNAXXXXXX
SRPXXXXXX
ERRXXXXXX

Each row can be a project, study, or run The pipeline will query metadata and expand each project into multiple SRR runs internally.

The metadata files are written under:

<outdir>/metadata/<sra>

2. Input: FASTQ list (--fastq)

Prepare a TSV describing local FASTQ files.

fastq.tsv

sample	read_type	reads
A98	hifi	a98.fastq.gz
B27	short	read_1.fastq.gz,read_2.fastq.gz
C03	nanopore	c03_pass.fastq.gz
D48	pacbio	d48.fastq.gz
  • sample: logical sample identifier.
  • read_type: types of reads used for assembler selection. Use descriptive labels below
    • short: short pair-end reads (Illumina, BGISEQ, DNBSEQ) (metaSPAdes),
    • nanopore: Nanopore reads (metaFlye),
    • pacbio: PacBio CLR reads (metaFlye),
    • hifi: PacBio HiFi reads (myloasm).
  • reads: comma‑separated list of FASTQ paths (absolute or relative); at least one file per row is required. Two or more files are treated as paired‑end for SingleM/metaSPAdes, one as single‑end.

In FASTQ + screening mode (--fastq_tsv + --taxa), each sample is treated as

sra = sample
srr = sample
platform = UNKNOWN
model = read_type
strategy = UNKNOWN
assembler = read_type

SingleM is run in place of Sandpiper for these samples.

In FASTQ + no‑screening mode (--fastq_tsv), reads go straight into assembly and optionally binning.

3. Input: taxa list (--taxa)

Provide a CSV of target taxa if you want taxon‑specific screening and contig extraction:

taxa.csv

rank,taxa
phylum,Bacillota
class,Gammaproteobacteria
order,o__Chloroflexales
genus,g__Escherichia

Allowed ranks (case-insensitive)

realm,domain,superkingdom,kingdom,phylum,class,order,family,genus,species

Important

  • If you do not supply --taxa, the pipeline skips SingleM/Sandpiper and taxon‑specific extraction.
  • If you supply the taxon in GTDB style, the pipeline runs SingleM/Sandpiper but skips taxon‑specific extraction

Full example command

nextflow run asuq/nf-sra_screen \
  -profile <docker/singularity/local/slurm/...> \
  --sra sra.csv \
  --fastq_tsv fastq.tsv \
  --taxdump /path/to/ncbi_taxdump_dir \
  --uniprot_db /path/to/uniprot.dmnd \
  --taxa taxa.csv \
  --gtdb_ncbi_map /path/to/ncbi_vs_gtdb_xlsx_dir \
  --sandpiper_db /path/to/sandpiper_db_dir \
  --singlem_db /path/to/singlem_metapackage \
  --outdir nf-sra_screen_results

Key parameters

  • -profile nextflow profile (see below)
  • --sra CSV with column sra listing project accessions
  • --fastq_tsv TSV with columns (sample,read_type,reads) listing sample reads
  • --taxdump Directory containing NCBI taxdump files; jsonify_taxdump.py will create taxdump.json
  • --uniprot_db UniProt DIAMOND database (.dmnd) (Follow blobtools tutorial)
  • --taxa (Optional) CSV with rank,taxa (NCBI or GTDB names). Use it if you want taxonomy screening
  • --gtdb_ncbi_map (Optional) Directory with ncbi_vs_gtdb_bacteria.xlsx and ncbi_vs_gtdb_archaea.xlsx. For taxonomy screening
  • --sandpiper_db (Optional) Directory with Sandpiper summary tables. For taxonomy screening
  • --singlem_db (Optional) SingleM metapackage (e.g. S5.4.0.GTDB_r226.metapackage_20250331.smpkg.zb) For taxonomy screening
  • --outdir Output directory (default: ./output)
  • --max_retries Maximum number of retries per process (default: 3)
  • --help Print the pipeline help message and exit.

Profiles

  • local
    • Executor: local
    • docker.enabled = true
    • Small queue size and moderate resources (max_cpus=8, max_memory=16.GB).
  • slurm
    • Executor: slurm
    • singularity.enabled = true
    • Large queue size (queueSize=2000) and increased resource caps.
  • oist
    • Includes conf/oist.config for OIST Deigo HPC settings.
  • debug
    • docker.enabled = true
    • executor.queueSize = 1
    • Extended trace.fields for debugging.
  • test
    • For small regression tests.
Output structure

Output structure

<output>/
  metadata/
    <sra>/
      <sra>.filtered.csv
      <sra>.skipped.csv
      <sra>.FAIL.note              # if metadata step failed

  <sra>/<srr>/
    # Screening
    sandpiper_report.txt
    sandpiper_output.tsv
    sandpiper_decision.txt
    singlem_taxonomic_profile.tsv
    singlem_taxonomic_profile_krona*
    singlem_output.tsv

    # Assembly
    assembly.fasta
    assembly.gfa
    spades.log / flye.log / myloasm.log
    fastp.html                     # short-read only

    # BlobToolKit
    blobtools.csv
    blobtools*.svg

    # Taxon extraction (if --taxa)
    summary.csv
    *.ids.csv
    *.fasta

    # Binning (if --binning)
    binning/
      metabat/
      comebin/
      semibin/
      rosella/
      dastool/
      metabat.note                 # if failed
      comebin.note                 # if failed
      semibin.note                 # if failed
      rosella.note                 # if failed
      dastool.note                 # if failed
      binning_note.txt             # aggregated notes

  summary.tsv                      # global summary across all samples
  execution-reports/
    timeline.html
    report.html
    trace.tsv
Managing storage with watch_and_transfer.sh

Managing storage with watch_and_transfer.sh

Long metagenomic runs can fill storage rapidly. The helper script watch_and_transfer.sh is designed to stream finished per‑sample output folders off the run directory to a longer‑term storage location and clean up safely.

What it does

Given:

  • RUN_DIR: the Nextflow run directory (where output/ and summary.tsv live)
  • DEST_DIR: a larger storage area (e.g. object store or shared filesystem)
  • INTERVAL_MINS: how often to scan for new samples

watch_and_transfer.sh will:

  1. Acquire an exclusive lock in RUN_DIR/.watch_and_transfer.lock so only one watcher instance runs per pipeline.

  2. Read RUN_DIR/output/summary.tsv and, for each (sra,srr):

    • Skip samples already listed in RUN_DIR/.processed_summary.tsv.
    • Skip samples that already have a pending transfer in RUN_DIR/.pending_copy_jobs.tsv.
  3. Interpret the note column:

    • If note starts with

      • did not match the criteria: the run was filtered at the metadata stage; it is recorded as processed without any transfer.
    • If note is non‑empty and output/$sra/$srr does not exist, the run is considered failed/filtered with no outputs and is marked processed.

    • Otherwise, the run is treated as a completed sample with outputs.

  4. For each completed sample with an output directory:

    • Submits a Slurm job via sbatch on partition datacp: rsync -a "${RUN_DIR}/output/$sra/$srr"/ "${DEST_DIR}/$sra/$srr"/

    • followed by: rm -rf ${RUN_DIR}/output/$sra/$srr

    • Attempt to rmdir the now‑empty ${RUN_DIR}/output/$sra directory.

Records (sra,srr,job_id) in .pending_copy_jobs.tsv.

  1. On each cycle, check_pending_jobs:
    • Queries Slurm with sacct for all pending job IDs.
    • For jobs that finished with State=COMPLETED and ExitCode starting with 0, logs success.
    • Deletes the corresponding slurm-<jobid>.out log.
    • Appends (sra,srr) to .processed_summary.tsv.
    • For jobs in transient states (PENDING/RUNNING/etc.), keeps them pending.
    • For failed/cancelled/time‑out jobs, removes them from pending; the sample will be re‑submitted in a later cycle.

The script runs indefinitely in a loop:

while :; do
  check_pending_jobs
  move_output_to_storage
  sleep "$INTERVAL_MINS" minutes
done

Requirements

  • Slurm environment with:
    • sbatch
    • sacct
    • a partition suitable for data transfer (the script uses -p datacp; change if needed)

The pipeline must be writing summary.tsv to RUN_DIR/output/summary.tsv, which is the default when --outdir output and you run from RUN_DIR.

Basic usage

From a login node (ideally in a tmux/screen session):

bin/watch_and_transfer.sh RUN_DIR DEST_DIR INTERVAL_MINS

Example:

bin/watch_and_transfer.sh \
  /fast/youruser/project_X/run1 \
  /long/yourgroup/project_X/archive \
  10

This will:

  • Check every 10 minutes for new rows in output/summary.tsv.
  • Start Slurm copy jobs as samples finish.
  • Free space under RUN_DIR/output once a copy is verified as successful.
  • Keep a small amount of state in:
    • RUN_DIR/.processed_summary.tsv
    • RUN_DIR/.pending_copy_jobs.tsv
    • RUN_DIR/.watch_and_transfer.lock
Example SLURM wrapper: run.sh

Example SLURM wrapper: run.sh

The repository includes an example wrapper run.sh showing how to run the pipeline and watcher together on a Slurm cluster.

What run.sh does 1.Defines user‑specific paths:

RUN_DIR='/fast/.../nf-sra_screen_run'
DEST_DIR='/long/.../nf-sra_screen_archive'
INTERVAL_MINS=10
NF_SRA_SCREEN='/path/to/nf-sra_screen'  # clone of this repo
  1. Installs a trap so that when the script exits (successfully or not), it:
  • Attempts to stop the background watcher process cleanly.
  • Preserves the original Nextflow exit status.
  1. Changes into RUN_DIR so that:
  • .nextflow.log, work/, and output/ live there.
  • watch_and_transfer.sh can find output/summary.tsv at the expected location.
  1. Starts the watcher in the background:
"${NF_SRA_SCREEN}/bin/watch_and_transfer.sh" \
  "${RUN_DIR}" \
  "${DEST_DIR}" \
  "${INTERVAL}" \
  > watch_and_transfer.log 2>&1 &

and records its PID in watch_and_transfer.pid.

  1. Runs the Nextflow pipeline (with your chosen profile and parameters):
nextflow run asuq/nf-sra_screen \
  -profile <docker/singularity/local/slurm/...> \
  --sra sra.csv \
  --fastq_tsv fastq.tsv \
  --taxdump /path/to/ncbi_taxdump_dir \
  --uniprot_db /path/to/uniprot.dmnd \
  --taxa taxa.csv \
  --binning \
  --gtdb_ncbi_map /path/to/ncbi_vs_gtdb_xlsx_dir \
  --sandpiper_db /path/to/sandpiper_db_dir \
  --singlem_db /path/to/singlem_metapackage \
  --outdir nf-sra_screen_results \
  -resume
  1. Exits with the same status code as the Nextflow run, triggering the EXIT trap, which in turn stops the watcher.

Adapting run.sh for your cluster

To reuse this pattern:

  1. Copy run.sh somewhere in your project.

  2. Edit:

    • RUN_DIR: a scratch or fast filesystem path for the actual run.
    • DEST_DIR: slower / archival filesystem for final results.
    • NF_SRA_SCREEN: path to your clone of this repository.
    • The Nextflow command at the bottom (profile name, database paths, etc.).
    • The Slurm partition used for data copy in watch_and_transfer.sh (-p datacp) if your site uses a different name.
    • Submit run.sh itself as a Slurm job or run it on a login node with tmux (depending on your site policy). All heavy work is still done by Nextflow processes and the per‑sample transfer jobs.

Credits

Author / maintainer: Akito Shima (ASUQ), akito-shima[at]oist.jp

Core tools (via containers)

  • iSeq
  • SRA toolkit
  • Sandpiper
  • SingleM
  • DIAMOND
  • BlobToolKit
  • fastp
  • metaSPAdes / SPAdes
  • Flye
  • myloasm
  • bowtie2
  • minimap2
  • samtools
  • MetaBAT2
  • ComeBin
  • SemiBin
  • Rosella
  • DAS Tool

About

nextflow pipeline for screening sra data

Resources

License

Stars

Watchers

Forks

Packages

No packages published