Skip to content

kjestradag/PATT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

${\color{black}PATT:\ {\color{red}P}roteome\ {\color{red}A}nnotation\ {\color{red}T}ransfer\ {\color{red}T}ool}$

pipeline

Proteome Annotation Transfer Tool (PATT) is a powerful and versatile software tool for transferring annotations from a reference genome to an unannotated query genome. Developed using the Snakemake workflow management system, PATT provides a highly parallelized architecture and efficient approach to annotating new genomes, enabling researchers to rapidly and accurately annotate large-scale genomic data sets. PATT searches for the best protein ortholog of a close reference in a genome that we want to annotate, generating the best model of it and returning its coding and peptide sequence as well as its coordinates through .gff and .gbk annotation files. PATT is designed to simplify the process of annotating new genomes, streamlining your research process and delivering high-quality results.

Dependencies:

Snakemake (https://snakemake.readthedocs.io/en/stable/index.html)

Exonerate (https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate) (v.2.4.0)

Blat (https://github.com/djhshih/blat)

Perl (https://www.perl.org/get.html) (v5.30.0)

AWK

Java

Parallel (https://manpages.ubuntu.com/manpages/impish/man1/parallel.1.html)

Perl Modules

Getopt::Long

Getopt::Std

Parallel::ForkManager

Installation:

Option 2

Make sure you have all dependencies installed. You also need to download and have in your path all the "bin" scripts.

To avoid errors with Java, you also need to create a variable with the absolute path of "readseq.jar" which is in the bin folder:

export CLASSPATH="/full/path/to/bin/readseq.jar"

You can check Snakemake on their site for more details of this.

Quick usage: (Install Option 2)

For genome.fasta and protein.faa file name run:

snakemake --cores -s /path/of/Snakefile

If genome or protein fastas files have other names, then run:

snakemake --cores <core_numbers> --config PROTREF="current_protein_fasta_filename" GENOME="current_genome_fasta_filename" -s path/of/Snakefile_PATT

More options

snakemake --cores <core_numbers> --rerun-incomplete --config PROTREF="protein.faa" GENOME="genome.fasta" PREFIX="prefix_outputfilename" NEWPREFIX="prefix_newgenenames_" -s path/of/Snakefile_PATT

About variables that PATT optionally needs:

GENOME= "genome.fasta" # Fasta file of genome that we want to annotate. Default: "genome.fasta"

PROTREF= "protein.faa" # Fasta file of the reference proteins that we want to transfer or annotate in our genome. Default: "protein.faa"

PREFIX= "prefix" # Output file prefix. Default: "mySpecies"

NEWPREFIX= "prefix_" # Prefix name we want for the proteins/transcripts in the our genome. We suggest ending in "" for aesthetics. Default: "{PREFIX}"

OLDPREFIX= "prefix" # Prefix that the proteins have in the faa to be transferred. Perl regular expressions are accepted ex: "^\S+gene[^_|\s]+". PATT generates new names of the transferred proteins keeping all(default) or a part of the original annotated protein identifier. Default: "=gene". ex: if the names of the proteins to be transferred have this form "tsol_\d+", my variable can be OLDPREFIX= "=genetsol_" and the new names will be "mySpecies_\d+"

Output files

The output of PATT produces 4 files:

File ".gff"

Annotation file in GFF format of the transferred proteins.

File ".gbk"

Annotation file in GenBank format of the transferred proteins.

File ".ffn"

Fasta file of all coding sequences (CDs).

File ".faa"

Fasta file of the peptide sequences.

Citation

Estrada, K. (2023). PATT (Proteome Annotation Transfer Tool) (Version 1) [Computer software]. https://doi.org/10.5281/zenodo.7958134

Acknowledgments

PATT wouldn't be the same without my fellow researchers at the UUSMB (Unidad Universitaria de Secuenciación Masiva y Bioinformática) Jerome Verleyen and Alejandro Sanchez, who helped me with ideas and challenges during PATT's development.

PATT uses Snakemake for pipeline development, Exonerate to perform alignments, Readseq for handling file formats, Mario Stanke script "gff2gbSmallDNA.pl" and many lines of code and scripts from my dear friend and god-level programmer, Alejandro Garciarrubio, I am grateful for his help and guidance.

Author

Karel Estrada

karel.estrada@ibt.unam.mx

Twitter: @kjestradag

About

Proteome Annotation Transfer Tool

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors