Skip to content

An R package for finding orthologous variants between species

Notifications You must be signed in to change notification settings

mustafapir/orthoVar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

orthoVar Package

orthoVar package provides functions for generating genome-wide multiple sequence alignments (msa) and finding orthologous variants between species. There are two functions in this package. orthoMSA is used to generate msa tables formatted for use in orthoFind function as an input.

orthoMSA function takes following arguments as input:

  • species1: Scientific name of the species whose sequence data will be base for alignment to build upon. Homo sapiens as default. This will also take other species in future releases.

  • species: A character string or character vector specifying scientific name of the species whose protein sequences will be aligned. Valid inputs are accessible with listSpecies() command.

  • humanSeqFile: Path of fasta file consisting of human protein sequences. Default is NA, which downloads file from NCBI.

  • seqFiles: A character string or character vector specifying path of fasta files consisting of protein sequences of other species specified in species argument. Default is NA, which downloads files from NCBI.

  • annot: Annotation source. Either ncbi or ensembl.

  • customOrt: A data frame consisting of gene orthology data for given species. Default is NA, which takes data from AllianceGenome: Alliance of Genome Resources (alliancegenome.org). This can either be ensembl or custom data, which should be in the same format as the data provided below, including column names:

Gene1Symbol Gene1SpeciesName Gene2Symbol Gene2SpeciesName
PSMB6 Homo sapiens PRE3 Saccharomyces cerevisiae
RPN1 Homo sapiens OST1 Saccharomyces cerevisiae
COX16 Homo sapiens COX16 Saccharomyces cerevisiae
SYS1 Homo sapiens SYS1 Saccharomyces cerevisiae
PHLPP2 Homo sapiens CYR1 Saccharomyces cerevisiae

The output is a data frame, with first two columns represent protein id and sequence for Homo sapiens. Other columns follow the same patterns, where every two columns belong to a species.

Example usage: hum_mouse <- orthoMSA(species1 = "Homo sapiens", species = "Mus musculus", customOrt = "ensembl", annot = "ensembl")

orthoFind function to find orthologous variants

orthoFind function can be used to find orthologous variants between species. msa table generated by orthoMSA function is used as an input. Other required files are:

  • Variant data (df1, df2) as data.table or data.frame, with at least following columns:

    • Refseq_ID: NCBI reference sequence id of proteins.

    • aapos: Amino acid position of variants.

    • from: Reference amino acid.

    • to: Converted amino acid.

Id aapos Allele_frequency from to Refseq_ID Gene_name type Phenotype Source
rs1185396016 240 5.50e-06 N S NP_003261 TSPAN6 unknown Unknown gnomAD
rs138104330 239 5.50e-06 N S NP_003261 TSPAN6 unknown Unknown gnomAD
rs778356735 237 5.50e-06 I T NP_003261 TSPAN6 unknown Unknown gnomAD
rs745504645 235 5.50e-06 R L NP_003261 TSPAN6 unknown Unknown gnomAD
rs745504645 235 1.47e-05 R H NP_003261 TSPAN6 unknown Unknown gnomAD

All arguments are listed below:

  • df1: Variant data for the first organism

  • df2: Variant data for the second organism

  • org1: Scientific name of the first organism.

  • org2: Scientific name of the second organism.

  • msa: msa table which is the output of orthoMSA function

  • ort: Should data be filtered according to type of variant (conserved and non-conserved). Default is TRUE. Refer to paper for detailed explanation: <https://doi.org/10.1101/2021.01.07.424951>

Output is data.table with each row representing a variant-orthologous variant combination. Below is an example output:

C_elegans_ID C_elegans_aapos C_elegans_from C_elegans_to Human_ID Human_aapos Human_from Human_to msa_id
NP_510365 80 P S NP_000545 64 P A 20
NP_510365 80 P S NP_000545 64 P T 20
NP_510365 256 P T NP_000545 272 P T 20
NP_510365 256 P T NP_000545 272 P S 20
NP_510365 80 P S NP_000545 64 P A 21

About

An R package for finding orthologous variants between species

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages