wf-assembly-snps-mod

A modified and enhanced Nextflow workflow for bacterial genome assembly-based SNP identification and phylogenetic analysis.

Note: This is a modified version of bacterial-genomics/wf-assembly-snps with additional features and optimizations.

🚀 Quick Start

# Typical usage for Burkholderia pseudomallei SNP analysis on APHL Analysis Laptop
nextflow run main.nf -profile local_workstation_rtx4070,docker \
  --input /path/to/assemblies  \
  --outdir /path/to/results \
  --recombination_aware_mode true \
  --integrate_results true \
  --mash_threshold 0.028  \
  --max_cluster_size 50  \   --merge_singletons true \
  --mash_sketch_size 50000  \
  --recombination gubbins \
  --snp_package parsnp \
  --run_gubbins

# Basic usage with input directory
nextflow run PHemarajata/wf-assembly-snps-mod \
  -profile docker \
  --input /path/to/assemblies \
  --outdir results

# Scalable mode for large datasets (>200 genomes)
nextflow run PHemarajata/wf-assembly-snps-mod \
  -profile docker \
  --input /path/to/assemblies \
  --outdir results \
  --scalable_mode true

# Recombination-aware analysis
nextflow run PHemarajata/wf-assembly-snps-mod \
  -profile docker \
  --input /path/to/assemblies \
  --outdir results \
  --recombination_aware_mode true

📋 Overview

This workflow identifies single nucleotide polymorphisms (SNPs) from bacterial genome assemblies and constructs phylogenetic trees. It offers three distinct analysis modes:

🔬 Standard Mode

Core genome alignment using Parsnp
SNP distance matrix calculation
Maximum likelihood phylogeny construction
Suitable for datasets up to ~200 genomes

📈 Scalable Mode

Divide-and-conquer approach for large datasets (hundreds to thousands of genomes)
Pre-clustering using Mash k-mer distances
Per-cluster analysis with SKA alignment and IQ-TREE2
Global integration with backbone phylogeny
Support for UShER incremental updates

🧬 Recombination-Aware Mode

Recombination detection using Gubbins
Masked SNP analysis excluding recombinant regions
Enhanced phylogenetic accuracy for highly recombinogenic species

🎯 Key Features

Multiple analysis modes optimized for different dataset sizes and biological questions
Flexible input formats: Directory of FASTA files or CSV samplesheet
Comprehensive output: SNP matrices, phylogenetic trees, quality reports
HPC ready with built-in profiles for various compute environments
Containerized with Docker, Singularity, and Conda support
Reproducible with detailed provenance tracking

📊 Input Requirements

Accepted File Formats

FASTA files with extensions: .fasta, .fas, .fna, .fsa, .fa
Optional compression: gzip (.gz)
File naming: No spaces, unique filenames required
Minimum size: 45kb (configurable with --min_input_filesize)

Input Methods

Option 1: Directory Input

--input /path/to/assemblies/

Option 2: Samplesheet Input

sample,file
SAMPLE_1,/path/to/SAMPLE_1.fasta
SAMPLE_2,/path/to/SAMPLE_2.fasta
SAMPLE_3,/path/to/SAMPLE_3.fasta

⚙️ Configuration Modes

Standard Mode (Default)

nextflow run PHemarajata/wf-assembly-snps-mod \
  --input assemblies/ \
  --outdir results

Scalable Mode

For datasets with >200 genomes:

nextflow run PHemarajata/wf-assembly-snps-mod \
  --input assemblies/ \
  --outdir results \
  --scalable_mode true \
  --mash_threshold 0.025 \
  --max_cluster_size 100

Recombination-Aware Mode

For accurate phylogeny in highly recombinogenic species:

nextflow run PHemarajata/wf-assembly-snps-mod \
  --input assemblies/ \
  --outdir results \
  --recombination_aware_mode true \
  --recombination gubbins

🛠️ Parameters

Core Parameters

Parameter	Default	Description
`--input`	`null`	Input directory or samplesheet (required)
`--outdir`	`null`	Output directory (required)
`--ref`	`null`	Reference genome (optional)
`--snp_package`	`parsnp`	SNP calling tool
`--min_input_filesize`	`45k`	Minimum input file size

Workflow Mode Parameters

Parameter	Default	Description
`--scalable_mode`	`false`	Enable scalable clustering workflow
`--recombination_aware_mode`	`true`*	Enable recombination detection
`--workflow_mode`	`cluster`	Workflow mode: cluster/place/global

Clustering Parameters (Scalable Mode)

Parameter	Default	Description
`--mash_threshold`	`0.028`	Distance threshold for clustering
`--max_cluster_size`	`50`	Maximum genomes per cluster
`--merge_singletons`	`true`	Merge singleton clusters
`--mash_sketch_size`	`50000`	Mash sketch size for large datasets
`--mash_kmer_size`	`21`	K-mer size for Mash
`--mash_min_copies`	`1`	Minimum k-mer copies

Recombination Analysis Parameters

Parameter	Default	Description
`--recombination`	`gubbins`	Recombination detection tool
`--run_gubbins`	`true`	Enable Gubbins analysis
`--gubbins_iterations`	`3`	Maximum Gubbins iterations
`--gubbins_use_hybrid`	`true`	Use hybrid tree building
`--gubbins_first_tree_builder`	`rapidnj`	Fast initial tree builder
`--gubbins_tree_builder`	`iqtree`	Refined tree builder
`--gubbins_min_snps`	`2`	Minimum SNPs for analysis

Parsnp Parameters

Parameter	Default	Description
`--curated_input`	`false`	Use curated input mode
`--tree_method`	`fasttree`	Tree method: fasttree/raxml
`--max_partition_size`	`15000`	Maximum partition size

IQ-TREE Parameters

Parameter	Default	Description
`--iqtree_model`	`GTR+ASC`	Evolutionary model
`--iqtree_asc_model`	`GTR+ASC`	Ascertainment bias correction

Integration Parameters

Parameter	Default	Description
`--integrate_results`	`true`	Integrate cluster results
`--alignment_method`	`snippy`	Alignment method (snippy/parsnp)
`--backbone_method`	`parsnp`	Backbone tree method

UShER Parameters

Parameter	Default	Description
`--build_usher_mat`	`false`	Build UShER mutation tree
`--existing_mat`	`null`	Existing UShER tree file

Output Parameters

Parameter	Default	Description
`--create_excel_outputs`	`false`	Create Excel format outputs
`--excel_sheet_name`	`Sheet1`	Excel sheet name
`--publish_dir_mode`	`copy`	How to publish outputs

Resource Limits

Parameter	Default	Description
`--max_memory`	`128.GB`	Maximum memory per process
`--max_cpus`	`16`	Maximum CPUs per process
`--max_time`	`240.h`	Maximum runtime per process

Note: Default values marked with * may be overridden in specific configuration profiles

🖥️ Compute Profiles

Pre-configured profiles for different computing environments:

docker - Docker containers (default)
singularity - Singularity containers
conda - Conda environments
local_workstation - Local workstation (12 cores, 64GB RAM)
dgx_station - DGX Station A100 (128 cores, 512GB RAM)
aspen_hpc - Aspen HPC cluster
rosalind_hpc - Rosalind HPC cluster

📈 Scalable Mode Details

The scalable mode implements a sophisticated divide-and-conquer approach:

1. Pre-clustering Phase

Fast k-mer distance estimation with Mash
Single-linkage clustering to group similar genomes
Automatic cluster size optimization

2. Per-cluster Analysis

Reference-free SNP alignment with SKA
Maximum likelihood phylogeny with IQ-TREE2
Optional recombination detection with Gubbins

3. Global Integration

Backbone phylogeny construction
Cross-cluster SNP distance matrices
Optional UShER mutation-annotated trees

📁 Output Structure

results/
├── alignments/          # Core genome alignments
├── snp_distances/       # SNP distance matrices
├── phylogeny/          # Phylogenetic trees (Newick format)
├── reports/            # Quality control reports
├── gubbins/            # Recombination analysis (if enabled)
├── clusters/           # Per-cluster results (scalable mode)
│   ├── cluster_1/      # Individual cluster results
│   └── cluster_N/      
├── backbone.treefile   # Backbone tree (scalable mode)
├── cluster_representatives.tsv  # Cluster representative mappings
├── final_grafted.treefile      # Complete grafted tree (if successful)
├── grafting_report.txt         # Tree grafting summary
├── grafting_log.txt           # Detailed grafting log
└── pipeline_info/      # Execution reports and logs

🔧 Installation

Prerequisites

Nextflow ≥22.04.3
Docker, Singularity, or Conda

Quick Installation

# Install Nextflow
curl -s https://get.nextflow.io | bash

# Test the workflow
nextflow run PHemarajata/wf-assembly-snps-mod -profile test,docker

📖 Documentation

Comprehensive documentation is available in the docs/ directory:

Usage Guide - Detailed usage instructions and examples
Output Description - Complete output file descriptions
Scalable Mode - In-depth scalable mode documentation
HPC Configuration - HPC cluster setup

🚀 Convenience Wrapper

For easier usage, a wrapper script is provided:

# Make the script executable
chmod +x run_workflow.sh

# Standard analysis
./run_workflow.sh --input assemblies/ --outdir results

# Scalable mode for large datasets
./run_workflow.sh --input assemblies/ --mode scalable --profile local_workstation

# Recombination-aware analysis
./run_workflow.sh --input assemblies/ --mode recombination

# Pass additional parameters
./run_workflow.sh --input assemblies/ --mode scalable -- --mash_threshold 0.025

🌳 Standalone Tree Grafting

When the main workflow encounters issues in the final tree grafting step, you can use the standalone graft_trees.py script to complete the phylogenetic analysis separately.

Background

In scalable mode, the workflow generates:

Backbone tree - Global phylogeny from cluster representatives
Cluster trees - Individual phylogenies for each genome cluster
Final step - Grafting cluster subtrees onto the backbone tree

The tree grafting step can sometimes fail due to memory constraints, label conflicts, or tree structure issues. The standalone script provides a robust solution with detailed logging and error handling.

Prerequisites

# Install required Python package
pip install biopython

Basic Usage

# Make the script executable
chmod +x graft_trees.py

# Basic tree grafting
./graft_trees.py \
  --backbone results/backbone.treefile \
  --clusters 'results/Clusters/**/cluster_*.final.treefile' \
  --reps results/cluster_representatives.tsv \
  --out-tree results/final_grafted.treefile \
  --report results/grafting_report.txt \
  --log results/grafting_log.txt

Advanced Options

# With conflict resolution and detailed logging
./graft_trees.py \
  --backbone results/backbone.treefile \
  --clusters 'results/Clusters/**/cluster_*.final.treefile' \
  --reps results/cluster_representatives.tsv \
  --out-tree results/final_grafted.treefile \
  --report results/grafting_report.txt \
  --log results/grafting_log.txt \
  --rename-conflicts \
  --parent-edge-mode keep

# Dry run to check what would be done
./graft_trees.py \
  --backbone results/backbone.treefile \
  --clusters 'results/Clusters/**/cluster_*.final.treefile' \
  --reps results/cluster_representatives.tsv \
  --out-tree results/final_grafted.treefile \
  --report results/grafting_report.txt \
  --log results/grafting_log.txt \
  --dry-run

Parameters

Parameter	Required	Description
`--backbone`	✅	Backbone Newick tree file
`--clusters`	✅	Glob pattern for cluster tree files (repeatable)
`--reps`	❌	TSV file with cluster→representative mappings
`--out-tree`	✅	Output combined tree file
`--report`	✅	Summary report file
`--log`	✅	Detailed log file
`--rename-conflicts`	❌	Rename conflicting tip labels
`--parent-edge-mode`	❌	Branch length handling: `keep` or `zero`
`--dry-run`	❌	Plan only, don't write output tree

Expected Input Files

From a scalable workflow run, you'll typically find:

results/
├── backbone.treefile              # Global backbone phylogeny
├── cluster_representatives.tsv    # Cluster→representative mapping
├── Clusters/
│   ├── cluster_1/
│   │   └── cluster_1.final.treefile
│   ├── cluster_2/
│   │   └── cluster_2.final.treefile
│   └── ...

Troubleshooting

Common Issues:

Missing representative file: If cluster_representatives.tsv is missing, the script will infer representatives automatically
Label conflicts: Use --rename-conflicts to automatically rename conflicting tip labels
Memory issues: The standalone script is more memory-efficient than the Nextflow process
Tree structure problems: Check the detailed log file for specific grafting failures

Check your results:

# Verify the final tree structure
python -c "
from Bio import Phylo
tree = Phylo.read('results/final_grafted.treefile', 'newick')
print(f'Final tree has {len(tree.get_terminals())} tips')
print(f'Tree depth: {tree.depth()}')
"

# View the grafting report
cat results/grafting_report.txt

Example Usage Script

A complete example script is provided to demonstrate the typical tree grafting workflow:

# Run the example (after completing a scalable workflow)
chmod +x examples/run_tree_grafting_example.sh
./examples/run_tree_grafting_example.sh

🧪 Testing

# Quick test with sample data
nextflow run PHemarajata/wf-assembly-snps-mod -profile test,docker

# Test scalable mode
nextflow run PHemarajata/wf-assembly-snps-mod -profile test,docker --scalable_mode true

# Use the convenient wrapper script
./run_workflow.sh --input test_data/ --mode scalable --profile docker

# Test the tree grafting script (requires Python + Biopython)
pip install biopython
./graft_trees.py --help

🤝 Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

Original bacterial-genomics/wf-assembly-snps workflow
nf-core community for workflow development best practices
All the amazing bioinformatics tool developers whose software powers this workflow

📧 Support

For questions or support:

Open an issue
Check the documentation
Review the usage examples

Citation: If you use this workflow in your research, please cite the original tools and consider citing this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
bin		bin
conf		conf
docs		docs
examples		examples
lib		lib
modules		modules
null/pipeline_info		null/pipeline_info
subworkflows/local		subworkflows/local
test_input		test_input
workflows		workflows
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
HPC_SCRIPTS.md		HPC_SCRIPTS.md
LICENSE		LICENSE
README.md		README.md
RELEASE_CHECKLIST.md		RELEASE_CHECKLIST.md
_run_snp_identification.uge-nextflow		_run_snp_identification.uge-nextflow
graft_trees.py		graft_trees.py
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
run_Parsnp_GENOMES.uge-nextflow		run_Parsnp_GENOMES.uge-nextflow
run_Parsnp_REFERENCE_vs_GENOMES.uge-nextflow		run_Parsnp_REFERENCE_vs_GENOMES.uge-nextflow
run_workflow.sh		run_workflow.sh
test_fixes.sh		test_fixes.sh
test_integrated.config		test_integrated.config
test_recombination_aware.sh		test_recombination_aware.sh
test_scalable.config		test_scalable.config
tower.yml		tower.yml
versions.yml		versions.yml

Folders and files

Latest commit

History

Repository files navigation

wf-assembly-snps-mod

🚀 Quick Start

📋 Overview

🔬 Standard Mode

📈 Scalable Mode

🧬 Recombination-Aware Mode

🎯 Key Features

📊 Input Requirements

Accepted File Formats

Input Methods

Option 1: Directory Input

Option 2: Samplesheet Input

⚙️ Configuration Modes

Standard Mode (Default)

Scalable Mode

Recombination-Aware Mode

🛠️ Parameters

Core Parameters

Workflow Mode Parameters

Clustering Parameters (Scalable Mode)

Recombination Analysis Parameters

Parsnp Parameters

IQ-TREE Parameters

Integration Parameters

UShER Parameters

Output Parameters

Resource Limits

🖥️ Compute Profiles

📈 Scalable Mode Details

1. Pre-clustering Phase

2. Per-cluster Analysis

3. Global Integration

📁 Output Structure

🔧 Installation

Prerequisites

Quick Installation

📖 Documentation

🚀 Convenience Wrapper

🌳 Standalone Tree Grafting

Background

Prerequisites

Basic Usage

Advanced Options

Parameters

Expected Input Files

Troubleshooting

Example Usage Script

🧪 Testing

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages