A splicing branchpoint reporter pipeline.
This pipeline has the following dependencies:
- python3 (tested with v3.10.6)
- bowtie2 (tested with v2.4.5)
- samtools (tested with v1.15.1)
- bedtools (tested with v2.30.0)
- numpy (tested with v1.23.2)
- pyfaidx (tested with v0.7.2.1)
- intervaltree (tested with v3.1.0)
These dependencies are included in the file environment.yaml
which can be used to make a conda environment for the pipeline by running conda env create -f environment.yaml
. Then, activate the environment with conda activate larmap_env
For M1 mac users: please install packages bowtie2
, bedtools
, and samtools
using the command arch -arm64 brew install [package]
before running conda
, if any of the above pacakges has not previously been installed.
Prior to running the pipeline, please use python get_splice_site_seqs.py [introns].bed [genome].fa [output_name_prefix]
to produce the reference file of sequences in 5nt window upstream of 5' splice sites; the output is to be used as the input of the -u
, or --ref_5p_upstream
option in the pipeline.
To run the larmap pipeline, use ./larmap_run.sh
with the following arguments:
-d, --fastq_dir Directory containing the FASTQ files
-1, --read_1_file Read one (R1) FASTQ file
-2, --read_2_file Read two (R2) FASTQ file
-o, --output_dir Directory for output files
-e, --output_base_name Prefix to add to output files
-c, --num_cpus Number of CPUs available
-i, --ref_b2index Bowtie2 index of the full reference genome
-f, --ref_fasta FASTA file of the full reference genome
-g, --ref_gtf GTF file with gene annotation of the reference genome
-5, --ref_5p_fasta FASTA file with sequences of first 20nt from reference 5' splice sites (first 20nt of introns)
-u, --ref_5p_upstream Custom file of sequences in 5nt window upstream of 5' splice sites
-3, --ref_3p_b2index Bowtie2 index file of last 250nt from reference 3' splice sites (last 250nt of introns)
-l, --ref_3p_lengths Custom file with the lengths of the sequences in ref_3p_b2index (some introns are <250nt)
-n, --ref_introns BED file of all introns in the reference genome
-m, --ref_repeatmasker BED file of repetitive elements from repeatmasker
A directory named [output_base_name]_lariat_mapping
will be created in output_dir
. Upon completion of the pipeline, this directory will contain a tab-separated results file with lariat read info called [output_base_name]_lariat_reads.txt
.
-
larmap_run.sh
callsmap_lariats.sh
sequentially for the read one (R1) and read two (R2) FASTQ files. This will produce three files in the output subdirectory for the read file: -[output_base_name]*[R1/R2]_total_reads.txt
(one line file containing count of linearly-aligned reads from read file)[output_base_name]_[R1/R2]_fivep_info_table.txt
(intermediate file containing info on the mapping of the 5'SS sequences to the unmapped reads)[output_base_name]_[R1/R2]\_final_info_table.txt
(results file containing candidate lariat reads obtained after mapping the 5'SS trimmed reads to the 3'SS region sequences)
The mapping script
map_lariats.sh
will:- Align reads to the reference genome with bowtie2; save mapped read count and proceed with unmapped reads
- Convert the unmapped reads bam file to FASTA format with samtools
- Build a bowtie2 index of the unmapped reads FASTA file
- Align a FASTA file of 5'SS to the unmapped reads index
- Trim reads with 5'SS alignments and write trimmed reads to FASTA file
- Align the trimmed reads to a Bowtie2 index of 3'SS regions
- Take the mapped trimmed reads from and create an output file containing candidate lariat reads
-
The
merge_filter_lariats.py
script loads intron and gene information from provided annotation files, combines the mapping results from each sample's read one and read two files, and performs post-mapping filtering before outputting the final lariat mapping results.The candidate lariat reads from each sample are filtered based on the following criteria: - BP is within 2bp of a splice site (likely from an intron circle, not a lariat) - 5'SS and 3'SS are not in the correct order - Read maps to a Ubiquitin gene (likely false positive due to repetitive nature of gene) - There is a valid aligment for the 3' segment upstream of the 5' segment - Both the 5'SS and the BP overlap with repetitive regions from RepeatMasker (likely false positive) - NEEDS TO BE IMPLEMENTED: Filter out template switching reads (eg. 5'SS 6bp sequence matches 6bp sequence downstream of BP) - NEEDS TO BE IMPLEMENTED: Correct BP position to account for RT skipping
After filtering, the final set of lariat reads for all the samples are merged (so that lariats discovered in both mates are only reported once) and output to the
results_path
file. The initial columns of this tab-delimited file are any additional custom columns which were added to the info file. After these columns, the file contains the following data for each lariat read: - Gene name - Gene Ensembl ID - Gene type (e.g. protein-coding, lncRNA) - Read ID - Read sequence - Chromosome - Strand - 5' splice site coordinate - 3' splice site coordinate - Branchpoint coordinate - Read branchpoint nucleotide - Genomic branchpoint nucleotide - Genomic branchpoint sequence context (10bp window with the BP at position 5) - Branchpoint position relative to nearest 3' splice site - Total count of linearly-aligned reads for the sample the lariat read is from (useful for normalization when comparing lariat levels between samples)