[TOC]
The standard MUGQIC DNA-Seq pipeline uses BWA to align reads to the reference genome. Treatment and filtering of mapped reads approaches as INDEL realignment, mark duplicate reads, recalibration and sort are executed using Picard and GATK. Samtools MPILEUP and bcftools are used to produce the standard SNP and indels variants file (VCF). Additional SVN annotations mostly applicable to human samples include mappability flags, dbSNP annotation and extra information about SVN by using published databases. The SNPeff tool is used to annotate variants using an integrated database of functional predictions from multiple algorithms (SIFT, Polyphen2, LRT and MutationTaster, PhyloP and GERP++, etc.) and to calculate the effects they produce on known genes. A list of effects and annotations that SnpEff calculate can be found here.
A summary html report is automatically generated by the pipeline. This report contains description of the sequencing experiment as well as a detailed presentation of the pipeline steps and results. Various Quality Control (QC) summary statistics are included in the report and additional QC analysis is accessible for download directly through the report. The report includes also the main references of the software and methods used during the analysis, together with the full list of parameters that have been passed to the pipeline main script.
An example of the DNA-Seq report for an analysis on public data is available for illustration purpose only: DNA-Seq report.
Here is more information about DNA-Seq pipeline that you may find interesting.
#!text
usage: dnaseq.py [-h] [--help] [-c CONFIG [CONFIG ...]] [-s STEPS]
[-o OUTPUT_DIR] [-j {pbs,batch,daemon}] [-f] [--json]
[--report] [--clean] [-l {debug,info,warning,error,critical}]
[-t {mugqic,mpileup}] [-r READSETS] [-v]
Version: 3.0.1-beta
For more documentation, visit our website: https://bitbucket.org/mugqic/mugqic_pipelines/
optional arguments:
-h show this help message and exit
--help show detailed description of pipeline and steps
-c CONFIG [CONFIG ...], --config CONFIG [CONFIG ...]
config INI-style list of files; config parameters are
overwritten based on files order
-s STEPS, --steps STEPS
step range e.g. '1-5', '3,6,7', '2,4-8'
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
output directory (default: current)
-j {pbs,batch,daemon}, --job-scheduler {pbs,batch,daemon}
job scheduler type (default: pbs)
-f, --force force creation of jobs even if up to date (default:
false)
--json create a JSON file per analysed sample to track the
analysis status (default: false)
--report create 'pandoc' command to merge all job markdown
report files in the given step range into HTML, if
they exist; if --report is set, --job-scheduler,
--force, --clean options and job up-to-date status are
ignored (default: false)
--clean create 'rm' commands for all job removable files in
the given step range, if they exist; if --clean is
set, --job-scheduler, --force options and job up-to-
date status are ignored (default: false)
-l {debug,info,warning,error,critical}, --log {debug,info,warning,error,critical}
log level (default: info)
-t {mugqic,mpileup}, --type {mugqic,mpileup}
DNAseq analysis type
-r READSETS, --readsets READSETS
readset file
-v, --version show the version information and exit
Steps:
------
----
mugqic:
1- picard_sam_to_fastq
2- trimmomatic
3- merge_trimmomatic_stats
4- bwa_mem_picard_sort_sam
5- picard_merge_sam_files
6- gatk_indel_realigner
7- merge_realigned
8- fix_mate_by_coordinate
9- picard_mark_duplicates
10- recalibration
11- verify_bam_id
12- metrics
13- picard_calculate_hs_metrics
14- gatk_callable_loci
15- extract_common_snp_freq
16- baf_plot
17- gatk_haplotype_caller
18- merge_and_call_individual_gvcf
19- combine_gvcf
20- merge_and_call_combined_gvcf
21- variant_recalibrator
22- dna_sample_metrics
23- haplotype_caller_filter_nstretches
24- haplotype_caller_flag_mappability
25- haplotype_caller_snp_id_annotation
26- haplotype_caller_snp_effect
27- haplotype_caller_dbnsfp_annotation
28- haplotype_caller_metrics_vcf_stats
29- haplotype_caller_metrics_snv_graph_metrics
----
mpileup:
1- picard_sam_to_fastq
2- trimmomatic
3- merge_trimmomatic_stats
4- bwa_mem_picard_sort_sam
5- picard_merge_sam_files
6- gatk_indel_realigner
7- merge_realigned
8- fix_mate_by_coordinate
9- picard_mark_duplicates
10- recalibration
11- metrics
12- picard_calculate_hs_metrics
13- gatk_callable_loci
14- extract_common_snp_freq
15- baf_plot
16- gatk_haplotype_caller
17- merge_and_call_individual_gvcf
18- combine_gvcf
19- merge_and_call_combined_gvcf
20- variant_recalibrator
21- dna_sample_metrics
22- rawmpileup
23- rawmpileup_cat
24- snp_and_indel_bcf
25- merge_filter_bcf
26- mpileup_filter_nstretches
27- mpileup_flag_mappability
28- mpileup_snp_id_annotation
29- mpileup_snp_effect
30- mpileup_dbnsfp_annotation
31- mpileup_metrics_vcf_stats
32- mpileup_metrics_snv_graph_metrics
33- verify_bam_id
Convert SAM/BAM files from the input readset file into FASTQ format if FASTQ files are not already specified in the readset file. Do nothing otherwise.
Raw reads quality trimming and removing of Illumina adapters is performed using Trimmomatic. If an adapter FASTA file is specified in the config file (section 'trimmomatic', param 'adapter_fasta'), it is used first. Else, 'Adapter1' and 'Adapter2' columns from the readset file are used to create an adapter FASTA file, given then to Trimmomatic. For PAIRED_END readsets, readset adapters are reversed-complemented and swapped, to match Trimmomatic Palindrome strategy. For SINGLE_END readsets, only Adapter1 is used and left unchanged.
This step takes as input files:
- FASTQ files from the readset file if available
- Else, FASTQ output files from previous picard_sam_to_fastq conversion of BAM files
The trim statistics per readset are merged at this step.
The filtered reads are aligned to a reference genome. The alignment is done per sequencing readset. The alignment software used is BWA with algorithm: bwa mem. BWA output BAM files are then sorted by coordinate using Picard.
This step takes as input files:
- Trimmed FASTQ files if available
- Else, FASTQ files from the readset file if available
- Else, FASTQ output files from previous picard_sam_to_fastq conversion of BAM files
BAM readset files are merged into one file per sample. Merge is done using Picard.
This step takes as input files:
- Aligned and sorted BAM output files from previous bwa_mem_picard_sort_sam step if available
- Else, BAM files from the readset file
Insertion and deletion realignment is performed on regions where multiple base mismatches
are preferred over indels by the aligner since it can appear to be less costly by the algorithm.
Such regions will introduce false positive variant calls which may be filtered out by realigning
those regions properly. Realignment is done using GATK.
The reference genome is divided by a number regions given by the nb_jobs parameter.
BAM files of regions of realigned reads are merged per sample using Picard.
Fix the read mates. Once local regions are realigned, the read mate coordinates of the aligned reads need to be recalculated since the reads are realigned at positions that differ from their original alignment. Fixing the read mate positions is done using BVATools.
Mark duplicates. Aligned reads per sample are duplicates if they have the same 5' alignment positions (for both mates in the case of paired-end reads). All but the best pair (based on alignment score) will be marked as a duplicate in the BAM file. Marking duplicates is done using Picard.
Recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so, provides not only more accurate quality scores but also more widely dispersed ones.
verifyBamID is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples. verifyBamID can detect sample contamination and swaps when external genotypes are available. When external genotypes are not available, verifyBamID still robustly detects sample swaps.
Compute metrics and generate coverage tracks per sample. Multiple metrics are computed at this stage: Number of raw reads, Number of filtered reads, Number of aligned reads, Number of duplicate reads, Median, mean and standard deviation of insert sizes of reads after alignment, percentage of bases covered at X reads (%_bases_above_50 means the % of exons bases which have at least 50 reads) whole genome or targeted percentage of bases covered at X reads (%_bases_above_50 means the % of exons bases which have at least 50 reads). A TDF (.tdf) coverage track is also generated at this step for easy visualization of coverage in the IGV browser.
Compute on target percent of hybridisation based capture.
Computes the callable region or the genome as a bed track.
Extracts allele frequencies of possible variants accross the genome.
Plots DepthRatio and B allele frequency of previously extracted alleles.
GATK haplotype caller for snps and small indels.
Merges the gvcfs of haplotype caller and also generates a per sample vcf containing genotypes.
Combine the per sample gvcfs of haplotype caller into one main file for all sample.
Merges the combined gvcfs and also generates a general vcf containing genotypes.
GATK VariantRecalibrator. The purpose of the variant recalibrator is to assign a well-calibrated probability to each variant call in a call set. You can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call. The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, MQ, HaplotypeScore, and ReadPosRankSum, for example) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input, typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array. This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model. Using the tranche file generated by the previous step the ApplyRecalibration walker looks at each variant's VQSLOD value and decides which tranche it falls in. Variants in tranches that fall below the specified truth sensitivity filter level have their filter field annotated with its tranche level. This will result in a call set that simultaneously is filtered to the desired level but also has the information necessary to pull out more variants for a higher sensitivity but a slightly lower quality level.
Merge metrics. Read metrics per sample are merged at this step.
See general filter_nstretches description ! Applied to haplotype caller vcf
See general flag_mappability ! Applied to haplotype caller vcf
See general snp_id_annotation ! Applied to haplotype caller vcf
See general snp_effect ! Applied to haplotype caller vcf
See general dbnsfp_annotation ! Applied to haplotype caller vcf
See general metrics_vcf_stats ! Applied to haplotype caller vcf
See general metrics_vcf_stats ! Applied to haplotype caller vcf