This is a Snakemake pipeline to generate a consensus sequence and call variants (SNPs and haplotypes) using viral amplicon short-read Illumina data and a reference genome. Much of the work is performed by the Python script ConsIter.py which iterates mapping of reads to a reference genome, calling variants, and updating the reference genome until there is no improvement in mapping rate. It then masks bases below a specified coverage threshold.
Download this repository with `git clone --recursive https://github.com/niaid/viral-assembly-variant-GATK
There are two pipeline options:
- Reference-mapping approach. Map error-corrected reads to a reference genome and generate a consensus sequence based on majority variants. Best if the data are expected to be quite similar to the reference genome.
- De-novo assembly and scaffolding. Assemble contigs directly from error-corrected reads. Scaffold the contigs (i.e. order and orient) based on alignment to a reference genome. Best if the data are expected to be quite divergent from the reference genome and may have large structural variants.
The ref
directory should contain adapter sequences and the reference genome. If targeted amplicon sequencing was performed, a multifasta file with each targeted region can be provided.
The data directory should have a separate directory for each sample. Multiple fastq files per sample will be concatenated prior to analysis. All fastq files should end with either _R1.fastq or _R2.fastq. Only paired-end (PE) reads are supported.
An example directory structure is below:
data
├── SRR8209080
│ └── fastq
│ ├── SRR8209080_R1.fastq
│ └── SRR8209080_R2.fastq
├── SRR8209081
│ └── fastq
│ ├── SRR8209081_R1.run1.fastq
│ └── SRR8209081_R2.run1.fastq
| ├── SRR8209081_R1.run2.fastq
│ └── SRR8209081_R2.run2.fastq
└── SRR8209083
└── fastq
├── SRR8209083_R1.fastq
└── SRR8209083_R2.fastq
Edit the config.yaml
file to indicate whether you want to run the reference-mapping- or denovo-assembly-based approach, and the names of files with the reference genome, adapter sequences, and (optionally) primer sequences. Those files should be placed in the ref
directory.
Edit the samples.csv
file with the names of the samples in the data
directory, structured as shown above.
Run: snakemake --cores <ncores> --use-conda all
with replaced with your desired number of CPU cores.
Using the default parameters in config.yaml, the pipeline may use up to 50Gb of memory. Depending on the size and quality of your input files, SPAdes error correction and/or assembly may use more.
Running this pipeline requires:
All other required bioinformatic tools will be installed in isolated conda environments automatically using Snakemake when the pipeline is run:
- BBTools
- Bowtie2 v2.4.4
- GATK v4.2.2
- Picard v2.26
- Samtools v1.13
- BedTools v2.30
- RagTag v2.0.1
- Minimap2
- Biopython v1.79
- Spades v3.15.5
- LoFreq v2.1.5
- CliqueSNV v2.0.2
- Bamtools v2.4.0
Output files are in each sample directory under data
.
*.consensus.fa
-- final consensus sequences for all sequences in reference fasta file
*.consensus.sample_name.fa
-- same as above but sample name included in fasta headers
*.bt2.rmdup.bam
-- input reads aligned to consensus sequences using bowtie2, duplicates removed
*.haplotypes.cliquesnv.fasta
-- haplotypes (and frequencies) for all references sequences reconstructed using cliqueSNV
*REF*.fasta
-- haplotypes reconstructed for each reference sequence in the reference fasta file
*.lofreq.vcf
-- within-sample variants (iSNVs) with frequencies determined using LoFreq
- Why does the consensus have Ns?
The ConsIter module masks consensus with Ns if the coverage goes below a certain threshold. In the example config.yaml, this threshold is set to 100x (see comments in the config for other parameters and the ConsIter repo for more info).
By using this software, you agree this software is to be used for research purposes only. Any presentation of data analysis using the software will acknowledge the software according to the guidelines below.
Primary author(s): David B. Stern
Organizational contact information: bioinformatics AT niaid.nih.gov
Date of release: 12/28/2021
Version: 0.1
License details: see LICENSE file
A review of this code has been conducted, no critical errors exist, and to the best of the authors knowledge, there are no problematic file paths, no local system configuration details, and no passwords or keys included in this code. This open source software comes as is with absolutely no warranty.