A modified and enhanced Nextflow workflow for bacterial genome assembly-based SNP identification and phylogenetic analysis.
Note: This is a modified version of bacterial-genomics/wf-assembly-snps with additional features and optimizations.
# Typical usage for Burkholderia pseudomallei SNP analysis on APHL Analysis Laptop
nextflow run main.nf -profile local_workstation_rtx4070,docker \
--input /path/to/assemblies \
--outdir /path/to/results \
--recombination_aware_mode true \
--integrate_results true \
--mash_threshold 0.028 \
--max_cluster_size 50 \ --merge_singletons true \
--mash_sketch_size 50000 \
--recombination gubbins \
--snp_package parsnp \
--run_gubbins
# Basic usage with input directory
nextflow run PHemarajata/wf-assembly-snps-mod \
-profile docker \
--input /path/to/assemblies \
--outdir results
# Scalable mode for large datasets (>200 genomes)
nextflow run PHemarajata/wf-assembly-snps-mod \
-profile docker \
--input /path/to/assemblies \
--outdir results \
--scalable_mode true
# Recombination-aware analysis
nextflow run PHemarajata/wf-assembly-snps-mod \
-profile docker \
--input /path/to/assemblies \
--outdir results \
--recombination_aware_mode trueThis workflow identifies single nucleotide polymorphisms (SNPs) from bacterial genome assemblies and constructs phylogenetic trees. It offers three distinct analysis modes:
- Core genome alignment using Parsnp
- SNP distance matrix calculation
- Maximum likelihood phylogeny construction
- Suitable for datasets up to ~200 genomes
- Divide-and-conquer approach for large datasets (hundreds to thousands of genomes)
- Pre-clustering using Mash k-mer distances
- Per-cluster analysis with SKA alignment and IQ-TREE2
- Global integration with backbone phylogeny
- Support for UShER incremental updates
- Recombination detection using Gubbins
- Masked SNP analysis excluding recombinant regions
- Enhanced phylogenetic accuracy for highly recombinogenic species
- Multiple analysis modes optimized for different dataset sizes and biological questions
- Flexible input formats: Directory of FASTA files or CSV samplesheet
- Comprehensive output: SNP matrices, phylogenetic trees, quality reports
- HPC ready with built-in profiles for various compute environments
- Containerized with Docker, Singularity, and Conda support
- Reproducible with detailed provenance tracking
- FASTA files with extensions:
.fasta,.fas,.fna,.fsa,.fa - Optional compression: gzip (
.gz) - File naming: No spaces, unique filenames required
- Minimum size: 45kb (configurable with
--min_input_filesize)
--input /path/to/assemblies/sample,file
SAMPLE_1,/path/to/SAMPLE_1.fasta
SAMPLE_2,/path/to/SAMPLE_2.fasta
SAMPLE_3,/path/to/SAMPLE_3.fastanextflow run PHemarajata/wf-assembly-snps-mod \
--input assemblies/ \
--outdir resultsFor datasets with >200 genomes:
nextflow run PHemarajata/wf-assembly-snps-mod \
--input assemblies/ \
--outdir results \
--scalable_mode true \
--mash_threshold 0.025 \
--max_cluster_size 100For accurate phylogeny in highly recombinogenic species:
nextflow run PHemarajata/wf-assembly-snps-mod \
--input assemblies/ \
--outdir results \
--recombination_aware_mode true \
--recombination gubbins| Parameter | Default | Description |
|---|---|---|
--input |
null |
Input directory or samplesheet (required) |
--outdir |
null |
Output directory (required) |
--ref |
null |
Reference genome (optional) |
--snp_package |
parsnp |
SNP calling tool |
--min_input_filesize |
45k |
Minimum input file size |
| Parameter | Default | Description |
|---|---|---|
--scalable_mode |
false |
Enable scalable clustering workflow |
--recombination_aware_mode |
true* |
Enable recombination detection |
--workflow_mode |
cluster |
Workflow mode: cluster/place/global |
| Parameter | Default | Description |
|---|---|---|
--mash_threshold |
0.028 |
Distance threshold for clustering |
--max_cluster_size |
50 |
Maximum genomes per cluster |
--merge_singletons |
true |
Merge singleton clusters |
--mash_sketch_size |
50000 |
Mash sketch size for large datasets |
--mash_kmer_size |
21 |
K-mer size for Mash |
--mash_min_copies |
1 |
Minimum k-mer copies |
| Parameter | Default | Description |
|---|---|---|
--recombination |
gubbins |
Recombination detection tool |
--run_gubbins |
true |
Enable Gubbins analysis |
--gubbins_iterations |
3 |
Maximum Gubbins iterations |
--gubbins_use_hybrid |
true |
Use hybrid tree building |
--gubbins_first_tree_builder |
rapidnj |
Fast initial tree builder |
--gubbins_tree_builder |
iqtree |
Refined tree builder |
--gubbins_min_snps |
2 |
Minimum SNPs for analysis |
| Parameter | Default | Description |
|---|---|---|
--curated_input |
false |
Use curated input mode |
--tree_method |
fasttree |
Tree method: fasttree/raxml |
--max_partition_size |
15000 |
Maximum partition size |
| Parameter | Default | Description |
|---|---|---|
--iqtree_model |
GTR+ASC |
Evolutionary model |
--iqtree_asc_model |
GTR+ASC |
Ascertainment bias correction |
| Parameter | Default | Description |
|---|---|---|
--integrate_results |
true |
Integrate cluster results |
--alignment_method |
snippy |
Alignment method (snippy/parsnp) |
--backbone_method |
parsnp |
Backbone tree method |
| Parameter | Default | Description |
|---|---|---|
--build_usher_mat |
false |
Build UShER mutation tree |
--existing_mat |
null |
Existing UShER tree file |
| Parameter | Default | Description |
|---|---|---|
--create_excel_outputs |
false |
Create Excel format outputs |
--excel_sheet_name |
Sheet1 |
Excel sheet name |
--publish_dir_mode |
copy |
How to publish outputs |
| Parameter | Default | Description |
|---|---|---|
--max_memory |
128.GB |
Maximum memory per process |
--max_cpus |
16 |
Maximum CPUs per process |
--max_time |
240.h |
Maximum runtime per process |
Note: Default values marked with * may be overridden in specific configuration profiles
Pre-configured profiles for different computing environments:
docker- Docker containers (default)singularity- Singularity containersconda- Conda environmentslocal_workstation- Local workstation (12 cores, 64GB RAM)dgx_station- DGX Station A100 (128 cores, 512GB RAM)aspen_hpc- Aspen HPC clusterrosalind_hpc- Rosalind HPC cluster
The scalable mode implements a sophisticated divide-and-conquer approach:
- Fast k-mer distance estimation with Mash
- Single-linkage clustering to group similar genomes
- Automatic cluster size optimization
- Reference-free SNP alignment with SKA
- Maximum likelihood phylogeny with IQ-TREE2
- Optional recombination detection with Gubbins
- Backbone phylogeny construction
- Cross-cluster SNP distance matrices
- Optional UShER mutation-annotated trees
results/
βββ alignments/ # Core genome alignments
βββ snp_distances/ # SNP distance matrices
βββ phylogeny/ # Phylogenetic trees (Newick format)
βββ reports/ # Quality control reports
βββ gubbins/ # Recombination analysis (if enabled)
βββ clusters/ # Per-cluster results (scalable mode)
β βββ cluster_1/ # Individual cluster results
β βββ cluster_N/
βββ backbone.treefile # Backbone tree (scalable mode)
βββ cluster_representatives.tsv # Cluster representative mappings
βββ final_grafted.treefile # Complete grafted tree (if successful)
βββ grafting_report.txt # Tree grafting summary
βββ grafting_log.txt # Detailed grafting log
βββ pipeline_info/ # Execution reports and logs
- Nextflow β₯22.04.3
- Docker, Singularity, or Conda
# Install Nextflow
curl -s https://get.nextflow.io | bash
# Test the workflow
nextflow run PHemarajata/wf-assembly-snps-mod -profile test,dockerComprehensive documentation is available in the docs/ directory:
- Usage Guide - Detailed usage instructions and examples
- Output Description - Complete output file descriptions
- Scalable Mode - In-depth scalable mode documentation
- HPC Configuration - HPC cluster setup
For easier usage, a wrapper script is provided:
# Make the script executable
chmod +x run_workflow.sh
# Standard analysis
./run_workflow.sh --input assemblies/ --outdir results
# Scalable mode for large datasets
./run_workflow.sh --input assemblies/ --mode scalable --profile local_workstation
# Recombination-aware analysis
./run_workflow.sh --input assemblies/ --mode recombination
# Pass additional parameters
./run_workflow.sh --input assemblies/ --mode scalable -- --mash_threshold 0.025When the main workflow encounters issues in the final tree grafting step, you can use the standalone graft_trees.py script to complete the phylogenetic analysis separately.
In scalable mode, the workflow generates:
- Backbone tree - Global phylogeny from cluster representatives
- Cluster trees - Individual phylogenies for each genome cluster
- Final step - Grafting cluster subtrees onto the backbone tree
The tree grafting step can sometimes fail due to memory constraints, label conflicts, or tree structure issues. The standalone script provides a robust solution with detailed logging and error handling.
# Install required Python package
pip install biopython# Make the script executable
chmod +x graft_trees.py
# Basic tree grafting
./graft_trees.py \
--backbone results/backbone.treefile \
--clusters 'results/Clusters/**/cluster_*.final.treefile' \
--reps results/cluster_representatives.tsv \
--out-tree results/final_grafted.treefile \
--report results/grafting_report.txt \
--log results/grafting_log.txt# With conflict resolution and detailed logging
./graft_trees.py \
--backbone results/backbone.treefile \
--clusters 'results/Clusters/**/cluster_*.final.treefile' \
--reps results/cluster_representatives.tsv \
--out-tree results/final_grafted.treefile \
--report results/grafting_report.txt \
--log results/grafting_log.txt \
--rename-conflicts \
--parent-edge-mode keep
# Dry run to check what would be done
./graft_trees.py \
--backbone results/backbone.treefile \
--clusters 'results/Clusters/**/cluster_*.final.treefile' \
--reps results/cluster_representatives.tsv \
--out-tree results/final_grafted.treefile \
--report results/grafting_report.txt \
--log results/grafting_log.txt \
--dry-run| Parameter | Required | Description |
|---|---|---|
--backbone |
β | Backbone Newick tree file |
--clusters |
β | Glob pattern for cluster tree files (repeatable) |
--reps |
β | TSV file with clusterβrepresentative mappings |
--out-tree |
β | Output combined tree file |
--report |
β | Summary report file |
--log |
β | Detailed log file |
--rename-conflicts |
β | Rename conflicting tip labels |
--parent-edge-mode |
β | Branch length handling: keep or zero |
--dry-run |
β | Plan only, don't write output tree |
From a scalable workflow run, you'll typically find:
results/
βββ backbone.treefile # Global backbone phylogeny
βββ cluster_representatives.tsv # Clusterβrepresentative mapping
βββ Clusters/
β βββ cluster_1/
β β βββ cluster_1.final.treefile
β βββ cluster_2/
β β βββ cluster_2.final.treefile
β βββ ...
Common Issues:
- Missing representative file: If
cluster_representatives.tsvis missing, the script will infer representatives automatically - Label conflicts: Use
--rename-conflictsto automatically rename conflicting tip labels - Memory issues: The standalone script is more memory-efficient than the Nextflow process
- Tree structure problems: Check the detailed log file for specific grafting failures
Check your results:
# Verify the final tree structure
python -c "
from Bio import Phylo
tree = Phylo.read('results/final_grafted.treefile', 'newick')
print(f'Final tree has {len(tree.get_terminals())} tips')
print(f'Tree depth: {tree.depth()}')
"
# View the grafting report
cat results/grafting_report.txtA complete example script is provided to demonstrate the typical tree grafting workflow:
# Run the example (after completing a scalable workflow)
chmod +x examples/run_tree_grafting_example.sh
./examples/run_tree_grafting_example.sh# Quick test with sample data
nextflow run PHemarajata/wf-assembly-snps-mod -profile test,docker
# Test scalable mode
nextflow run PHemarajata/wf-assembly-snps-mod -profile test,docker --scalable_mode true
# Use the convenient wrapper script
./run_workflow.sh --input test_data/ --mode scalable --profile docker
# Test the tree grafting script (requires Python + Biopython)
pip install biopython
./graft_trees.py --helpContributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Original bacterial-genomics/wf-assembly-snps workflow
- nf-core community for workflow development best practices
- All the amazing bioinformatics tool developers whose software powers this workflow
For questions or support:
- Open an issue
- Check the documentation
- Review the usage examples
Citation: If you use this workflow in your research, please cite the original tools and consider citing this repository.