Skip to content

DKFZ-UNITE-Administration/SNVCallingWorkflow

 
 

Repository files navigation

DKFZ SNVCalling Workflow

An SNV calling workflow developed in the Applied Bioinformatics and Theoretical Bioinformatics groups at the DKFZ. An earlier version (pre Github) of this workflow was used in the Pancancer project.

de.NBI logoYour opinion matters! The development of this workflow is supported by the German Network for Bioinformatic Infrastructure (de.NBI). By completing this very short (30-60 seconds) survey you support our efforts to improve this tool.

Installation

To run the workflow you first need to install a number of components and dependencies.

  • You need a working Roddy installation. The version depends on the workflow version you want to use. You can find it in the buildinfo.txt under 'RoddyAPIVersion'. Please follow the instructions for the installation of Roddy itself and the PluginBase and the DefaultPlugin components. The main reference here is the Roddy documentation.
  • Install the version you need -- either from the release tarballs or with git clone into your plugin directory.

Furthermore you need a number of tools and of course reference data, like a genome assembly and annotation databases.

Tool installation

The workflow contains a description of a Conda environment. A number of Conda packages from BioConda are required.

First install the BioConda channels:

conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels bioconda-legacy

Then install the environment

conda env create -n SNVCallingWorkflow -f $PATH_TO_PLUGIN_DIRECTORY/resources/analysisTools/snvPipeline/environments/conda.yml

The name of the Conda environment is arbitrary but needs to be consistent with the condaEnvironmentName variable. The default for that variable is set in resources/configurationFiles/analysisSNVCalling.xml.

Note that the Conda environment not exactly the same as the software stack used for the Pancancer project.

PyPy

PyPy is an alternative Python interpreter. Some of the Python scripts in the workflow can use PyPy to achieve higher performance by employing a fork of hts-python. Currently, this is not implemented for the Conda environment. For most cases you therefore should set the PYPY_OR_PYTHON_BINARY variable to just python to use the Python binary from the Conda environment. You could set up a resources/analysisTools/snvPipeline/environments/conda_snvAnnotation.sh similar to the tbi-lsf-cluster_snvAnnotation.sh file in the same directory.

Reference data installation

TBD

Running the workflow

Configuration Values

Switch Default Description
bamfile_list empty Semicolon-separated list of BAM files, starting with the control's BAM. Each BAM file needs an index file with the same name as the BAM, but ".bai" suffixed
sample_list empty Semicolon-separated list of sample names in the same order as bamfile_list
possibleTumorSampleNamePrefixes "( tumor )" Bash-array of tumor sample name prefixes
possibleControlSampleNamePrefixes "( control )" Bash-array of control sample name prefixes
CHROMOSOME_INDICES empty Bash-array of chromosome names to which the analysis should be restricted
CHROMOSOME_LENGTH_FILE empty Headerless TSV file with chromosome name, chromosome size columns
CHR_SUFFIX "" Suffix added to the chromosome names
CHR_PREFIX "" Prefix added to the chromosome names
extractSamplesFromOutputFiles true Refer to the documentation of the COWorkflowBasePlugin for further information
PYPY_OR_PYTHON_BINARY pypy The binary to use for a some of the Python scripts. For filter_PEoverlap.py using a PyPy binary here also triggers the use of hts-python instead of pysam.

Example Call

roddy.sh run projectConfigurationName@analysisName patientId \
--useconfig=/path/to/your/applicationProperties.ini --configurationDirectories=/path/to/your/projectConfigs \
--useiodir=/input/directory,/output/directory/snv \
--usePluginVersion=SNVCallingWorkflow:1.3.2 \
--cvalues="bamfile_list:/path/to/your/control.bam;/path/to/your/tumor.bam,sample_list:normal;tumor,possibleTumorSampleNamePrefixes:tumor,possibleControlSampleNamePrefixes:normal,REFERENCE_GENOME:/reference/data/hs37d5_PhiX.fa,CHROMOSOME_LENGTH_FILE:/reference/data/hs37d5_PhiX.chromSizes,extractSamplesFromOutputFiles:false"

No Control

TBD

Cross-Species Contaminations

In coding regions, the expected proportion of synonymous mutations compared to the total number of mutations should be low. By contrast, a high proportion of synonymous mutations suggests cross-species contamination. Any value above 0.5 (i.e. at least 50% of mutations are synonymous) is indicating a contamination. A value below 0.35 is considered to be OK. Values in the range of 0.35-0.5 are unclear.

Contributors

Have a look at the Contributors file.

About

The DKFZ-ODCF, formerly DKFZ/eilslabs SNV-Calling Workflow

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 32.2%
  • R 27.7%
  • Perl 22.1%
  • Shell 14.3%
  • Java 3.7%