fusionTools

Python scripts processing fusion breakpoints and visualize them with D3.js

The objectives of fusionTools are:

Determine the fusion cDNA and protein sequences
Determine the fusion type (in-frame, out-of-frame or right gene intact)
Tier the importance of the fusion events
Visualize the results in html format

Installation

1. Download Pfam domain database

Please download Pfam domain database: http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz Then

module load hmmer
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm

If you do not have hmmer in your system, please goto http://hmmer.org/ to download and install it.

2. Download genome FASTA file

You can download the genome FASTA file from: https://www.gencodegenes.org/human/release_36lift37.html or https://hgdownload.soe.ucsc.edu/downloads.html

2.1 Unzip file

Plase unzip/gunzip the file after the file is downloaded.

gunzip hg19.fa.gz

2.2 Index file

samtools faidx hg19.fa.gz

3.1 Pull Docker/Singularity image

The easiest way to run fusionTool is use docker image.

3.1.1 Use Docker

docker pull hsienchao/fusion_tools:v1

3.1.2 Use Singularity

module load singularity
export SINGULARITY_CACHEDIR=/data/somewhere
singularity pull docker://hsienchao/fusion_tools:v1

3.2 Run fusionTools without Docker/Singularity image

If you want to install required packages/softwares by yourself, please follow the instructions:

3.2.1 Pull the code from Github

git clone https://github.com/CCRGeneticsBranch/fusionTools.git

3.2.1 Python packages

Install python 3.7+ in your system.
Install required packages:

pip install --upgrade gtfparse pyfaidx dataclasses pysam pyyaml Bio numpy pandas pybedtools

3.2.2 Hmmer

Hmmer is a tool to predict protein domains. Please download and install by following the instruction on http://hmmer.org/

3.2.3 Add PfamScan Perl global variable

export PERL5LIB=/[your installation path]/PfamScan:${PERL5LIB}

Run fusionTool

Run fusionTools.py

Process the single fusion file

usage: fusionTools.py 

Required:
                      [--input, -i Fusion file]
                      [--output output prefix ]
                      [--fasta, -f Genome FASTA file]
                      [--pfam_file, -p Pfam domain file]
Optional:
                      [--isoform_expression_file, -m Isoform expression file in RSEM format]
                      [--gtf GTF file]
                      [--canonical_trans_file Canonical transcript list]
                      [--fusion_cancer_gene_list Fusion cancer gene pair list]
                      [--cancer_gene_list Cancer gene list]
                      [--domain_file Pfam domain file]
                      [--threads Number of threads]

gtf: GTF file
fasta: Genome FASTA file
isoform_expression_file: RSEM format isoform expression file
canonical_trans_file: Ensembl canonical transcript file
fusion_cancer_gene_list: two column fusion gene pair list (default: Sanger Mitelman list)
cancer_gene_list: one column cancer gene symbol list
pfam_file: Pfam domain file
input: input fusion list
output: output prefix
threads: number of threads

Example

module load python
module load hmmer
export PERL5LIB=/your_pfam_scan_path:$PERL5LIB
python fusionTools -g hg19.refseq.gtf -f genome.fa -i fusion_list.txt -o processed_fusion_list.txt -t 16

Process the case

We also developed a script that can process khanlab pipeline cases:

usage: processFusionCase.h 

required:
-d: processed data path
-p: patient ID
-c: case ID
-f: Pfam DB folder
-g: Genome fasta

optional:
-t: number of threads. (default: SLURM_CPUS_PER_TASK variable)
-o: output folder (default: same as input folder)
-v: Gencode version (36, 37, or hg38v39. default: 37 (hg19))

Example:

hg19 (use -v 36 or 37)

./processFusionCase.sh -d /data/Compass/Analysis/ProcessedResults_NexSeq/ExomeRNA_Results \
                       -p CP02796 \
                       -c RT-0391 \
                       -f /data/Clinomics/Ref/khanlab/PfamDB \
                       -g /data/Clinomics/Ref/khanlab/ucsc.hg19.fasta \
                       -v 36

hg38 (use -v hg38v39)

./processFusionCase.sh -d /data/Compass/Analysis/ProcessedResults_NexSeq/ExomeRNA_Results \
                       -p CP02796 \
                       -c RT-0391 \
                       -f /data/Clinomics/Ref/khanlab/PfamDB \
                       -g /data/Clinomics/Ref/khanlab/Index/BWAIndex/hg38.fa \
                       -v hg38v39

Run with Docker

Example:

sudo docker run -v /data:/data fusion_tools:v1 fusionTools.py \
	-i /data/processed_DATA/CP02796/RT-0391/Actionable/CP02796.fusion.actionable.txt \
	-o /data/processed_DATA/CP02796/RT-0391/CP02796/db/CP02796.fusion \
	-m /data/processed_DATA/CP02796/RT-0391/CP02796_T2R_T2/RSEM/CP02796_T2R_T2.rsem.isoforms.results \
	-p /data/ref/PfamDB \
	-f /data/ref/hg19.fasta \
	-t 4

Run with Singularity

Usage:

usage: processFusionCase.h 

required:
-d: processed data path
-p: patient ID
-c: case ID
-f: Pfam DB folder
-g: Genome fasta

optional:
-t: number of threads. (default: SLURM_CPUS_PER_TASK variable)
-o: output folder (default: same as input folder)
-v: Gencode version (default: 37)

Example 1 (hg19 Gencode v37litf37):

singularity exec -e --bind /data/khanlab/projects/processed_DATA,/data/Clinomics/Ref/khanlab/ fusion_tools_v1.sif processFusionCase.sh \
                       -d /data/khanlab/projects/processed_DATA \
                       -p RH4 \
                       -c Khanlab \
                       -f /data/Clinomics/Ref/khanlab/PfamDB \
                       -g /data/Clinomics/Ref/khanlab/ucsc.hg19.fasta

The output file will be: /data/khanlab/projects/processed_DATA/RH4/Khanlab/RH4/db/RH4.fusion.txt

Example 2 (hg19 Gencode v36litf37):

singularity exec -e --bind /data/Compass/Analysis/ProcessedResults_NexSeq/ExomeRNA_Results,/data/Clinomics/Ref/khanlab/ fusion_tools_v1.sif processFusionCase.sh \
                       -d /data/Compass/Analysis/ProcessedResults_NexSeq/ExomeRNA_Results \
                       -p CP02796 \
                       -c RT-0391 \
                       -f /data/Clinomics/Ref/khanlab/PfamDB \
                       -g /data/Clinomics/Ref/khanlab/ucsc.hg19.fasta \
                       -v 36

Example 3 (hg38 Gencode v39):

singularity exec -e --bind /data/Compass/Analysis/ProcessedResults_NexSeq/ExomeRNA_Results,/data/Clinomics/Ref/khanlab/ fusion_tools_v1.sif processFusionCase.sh \
                       -d /data/Compass/Analysis/ProcessedResults_NexSeq/ExomeRNA_Results \
                       -p CP02796 \
                       -c RT-0391 \
                       -f /data/Clinomics/Ref/khanlab/PfamDB \
                       -g /data/Clinomics/Ref/khanlab/Index/BWAIndex/hg38.fa \
                       -v hg38v39

The output file will be: /data/Compass/Analysis/ProcessedResults_NexSeq/ExomeRNA_Results/CP02796/RT-0391/CP02796/db/CP02796.fusion.txt

Input data

Input example

LeftGene	RightGene	Chr_Left	Position	Chr_Right	Position	Sample	Tool	SpanReadCount
PAX7	FOXO1	chr1	19029790	chr13	41134997	RMSXXX	FusionCatcher	17
PAX7	FOXO1	chr1	19029790	chr13	41134997	RMSXXX	STAR-fusion	20
PAX7	FOXO1	chr1	19029790	chr13	41134997	RMSXXX	tophatFusion	24
AMD1	FARS2	chr6	111196418	chr6	5545413	RMSXXX	STAR-fusion	2

Output data

We have two output files:

Text file

Example:

left_gene	right_gene	left_chr	right_chr	left_position	right_position	sample_id	tools	type	tier	left_region	right_region	left_trans	right_trans	left_fusion_cancer_gene	right_fusion_cancer_gene	left_cancer_gene	right_cancer_gene	fusion_proteins	left_trans_info	right_trans_info
PAX7	FOXO1	chr1	19029790	chr13	41134997	RMS2074_D1C5FACXX	[{"FusionCatcher": 17}, {"STAR-fusion": 20}, {"tophatFusion": 24}]	in-frame	1.1	CDS:exon4	CDS:exon2	NM_001135254	NM_002015	Y	Y	Y	Y	{"MAALPGT...VSG*": {"domains": ...}}	...	...
AMD1	FARS2	chr6	111196418	chr6	5545413	RMS2074_D1C5FACXX	[{"STAR-fusion": 2}]	out-of-frame	4.3	CDS	CDS	NM_001634	NM_006567	N	N	N	N	{"MEAAHFF...}	...	...

HTML file

You can open this file in your local computer. By default, all the in-frame fusions will be displayed.

By clicking the "open" icon, you can see the detailed fusion results at transcript level.

This table is sorted by fusion proteins (many transcript combination has the same protein product). You can click again the open icon to see the fusion plot:

The plot shows both fusion DNA and proteins. The domains are predicted using Pfam. The bottom of the plot is the predicted cDNA and protein sequence.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
PfamScan		PfamScan
data		data
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
classes.py		classes.py
cleanDomain.py		cleanDomain.py
convertGeneBED.py		convertGeneBED.py
details.png		details.png
fusionTools.png		fusionTools.png
fusionTools.py		fusionTools.py
gen_canonical_gtf.pl		gen_canonical_gtf.pl
gen_canonical_gtf.sh		gen_canonical_gtf.sh
gene-fusion.local.js		gene-fusion.local.js
main.png		main.png
makeCanonical.R		makeCanonical.R
makeDomainFile.py		makeDomainFile.py
makeDomainFileChunk.py		makeDomainFileChunk.py
makeDomainFileChunk.sh		makeDomainFileChunk.sh
makeDomainFileChunks.sh		makeDomainFileChunks.sh
makeOutputHTML.py		makeOutputHTML.py
plot.png		plot.png
processFusionCase.sh		processFusionCase.sh
process_gtf.sh		process_gtf.sh
saveTranscriptTable.py		saveTranscriptTable.py
sequences.png		sequences.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

fusionTools

Installation

1. Download Pfam domain database

2. Download genome FASTA file

2.1 Unzip file

2.2 Index file

3.1 Pull Docker/Singularity image

3.1.1 Use Docker

3.1.2 Use Singularity

3.2 Run fusionTools without Docker/Singularity image

3.2.1 Pull the code from Github

3.2.1 Python packages

3.2.2 Hmmer

3.2.3 Add PfamScan Perl global variable

Run fusionTool

Run fusionTools.py

Process the single fusion file

Example

Process the case

Example:

Run with Docker

Run with Singularity

Example 1 (hg19 Gencode v37litf37):

Example 2 (hg19 Gencode v36litf37):

Example 3 (hg38 Gencode v39):

Input data

Output data

Text file

HTML file

About

Uh oh!

Releases

Packages

Uh oh!

Languages

CCRGeneticsBranch/fusionTools

Folders and files

Latest commit

History

Repository files navigation

fusionTools

Installation

1. Download Pfam domain database

2. Download genome FASTA file

2.1 Unzip file

2.2 Index file

3.1 Pull Docker/Singularity image

3.1.1 Use Docker

3.1.2 Use Singularity

3.2 Run fusionTools without Docker/Singularity image

3.2.1 Pull the code from Github

3.2.1 Python packages

3.2.2 Hmmer

3.2.3 Add PfamScan Perl global variable

Run fusionTool

Run fusionTools.py

Process the single fusion file

Example

Process the case

Example:

Run with Docker

Run with Singularity

Example 1 (hg19 Gencode v37litf37):

Example 2 (hg19 Gencode v36litf37):

Example 3 (hg38 Gencode v39):

Input data

Output data

Text file

HTML file

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages