An SNV calling workflow developed in the Applied Bioinformatics and Theoretical Bioinformatics groups at the DKFZ. An earlier version (pre Github) of this workflow was used in the Pancancer project.
Your opinion matters! The development of this workflow is supported by the German Network for Bioinformatic Infrastructure (de.NBI). By completing this very short (30-60 seconds) survey you support our efforts to improve this tool.
To run the workflow you first need to install a number of components and dependencies.
- You need a working Roddy installation. The version depends on the workflow version you want to use. You can find it in the buildinfo.txt under 'RoddyAPIVersion'. Please follow the instructions for the installation of Roddy itself and the PluginBase and the DefaultPlugin components. The main reference here is the Roddy documentation.
- Install the version you need -- either from the release tarballs or with git clone into your plugin directory.
Furthermore you need a number of tools and of course reference data, like a genome assembly and annotation databases.
The workflow contains a description of a Conda environment. A number of Conda packages from BioConda are required.
First install the BioConda channels:
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels bioconda-legacy
Then install the environment
conda env create -n SNVCallingWorkflow -f $PATH_TO_PLUGIN_DIRECTORY/resources/analysisTools/snvPipeline/environments/conda.yml
The name of the Conda environment is arbitrary but needs to be consistent with the condaEnvironmentName variable. The default for that variable is set in resources/configurationFiles/analysisSNVCalling.xml.
Note that the Conda environment not exactly the same as the software stack used for the Pancancer project.
PyPy is an alternative Python interpreter. Some of the Python scripts in the workflow can use PyPy to achieve higher performance by employing a fork of hts-python. Currently, this is not implemented for the Conda environment. For most cases you therefore should set the PYPY_OR_PYTHON_BINARY variable to just python to use the Python binary from the Conda environment. You could set up a resources/analysisTools/snvPipeline/environments/conda_snvAnnotation.sh similar to the tbi-lsf-cluster_snvAnnotation.sh file in the same directory.
TBD
| Switch | Default | Description |
|---|---|---|
| bamfile_list | empty | Semicolon-separated list of BAM files, starting with the control's BAM. Each BAM file needs an index file with the same name as the BAM, but ".bai" suffixed |
| sample_list | empty | Semicolon-separated list of sample names in the same order as bamfile_list |
| possibleTumorSampleNamePrefixes | "( tumor )" | Bash-array of tumor sample name prefixes |
| possibleControlSampleNamePrefixes | "( control )" | Bash-array of control sample name prefixes |
| CHROMOSOME_INDICES | empty | Bash-array of chromosome names to which the analysis should be restricted |
| CHROMOSOME_LENGTH_FILE | empty | Headerless TSV file with chromosome name, chromosome size columns |
| CHR_SUFFIX | "" | Suffix added to the chromosome names |
| CHR_PREFIX | "" | Prefix added to the chromosome names |
| extractSamplesFromOutputFiles | true | Refer to the documentation of the COWorkflowBasePlugin for further information |
| PYPY_OR_PYTHON_BINARY | pypy | The binary to use for a some of the Python scripts. For filter_PEoverlap.py using a PyPy binary here also triggers the use of hts-python instead of pysam. |
roddy.sh run projectConfigurationName@analysisName patientId \
--useconfig=/path/to/your/applicationProperties.ini --configurationDirectories=/path/to/your/projectConfigs \
--useiodir=/input/directory,/output/directory/snv \
--usePluginVersion=SNVCallingWorkflow:1.3.2 \
--cvalues="bamfile_list:/path/to/your/control.bam;/path/to/your/tumor.bam,sample_list:normal;tumor,possibleTumorSampleNamePrefixes:tumor,possibleControlSampleNamePrefixes:normal,REFERENCE_GENOME:/reference/data/hs37d5_PhiX.fa,CHROMOSOME_LENGTH_FILE:/reference/data/hs37d5_PhiX.chromSizes,extractSamplesFromOutputFiles:false"TBD
In coding regions, the expected proportion of synonymous mutations compared to the total number of mutations should be low. By contrast, a high proportion of synonymous mutations suggests cross-species contamination. Any value above 0.5 (i.e. at least 50% of mutations are synonymous) is indicating a contamination. A value below 0.35 is considered to be OK. Values in the range of 0.35-0.5 are unclear.
Have a look at the Contributors file.
