orthoVar
package provides functions for generating genome-wide
multiple sequence alignments (msa) and finding orthologous variants
between species. There are two functions in this package. orthoMSA
is
used to generate msa tables formatted for use in orthoFind
function as
an input.
orthoMSA
function takes following arguments as input:
-
species1
: Scientific name of the species whose sequence data will be base for alignment to build upon.Homo sapiens
as default. This will also take other species in future releases. -
species
: A character string or character vector specifying scientific name of the species whose protein sequences will be aligned. Valid inputs are accessible withlistSpecies()
command. -
humanSeqFile
: Path of fasta file consisting of human protein sequences. Default isNA
, which downloads file from NCBI. -
seqFiles
: A character string or character vector specifying path of fasta files consisting of protein sequences of other species specified inspecies
argument. Default isNA
, which downloads files from NCBI. -
annot
: Annotation source. Eitherncbi
orensembl
. -
customOrt
: A data frame consisting of gene orthology data for given species. Default isNA
, which takes data from AllianceGenome: Alliance of Genome Resources (alliancegenome.org). This can either beensembl
or custom data, which should be in the same format as the data provided below, including column names:
Gene1Symbol | Gene1SpeciesName | Gene2Symbol | Gene2SpeciesName |
---|---|---|---|
PSMB6 | Homo sapiens | PRE3 | Saccharomyces cerevisiae |
RPN1 | Homo sapiens | OST1 | Saccharomyces cerevisiae |
COX16 | Homo sapiens | COX16 | Saccharomyces cerevisiae |
SYS1 | Homo sapiens | SYS1 | Saccharomyces cerevisiae |
PHLPP2 | Homo sapiens | CYR1 | Saccharomyces cerevisiae |
The output is a data frame
, with first two columns represent protein
id and sequence for Homo sapiens
. Other columns follow the same
patterns, where every two columns belong to a species.
Example usage:
hum_mouse <- orthoMSA(species1 = "Homo sapiens", species = "Mus musculus", customOrt = "ensembl", annot = "ensembl")
orthoFind
function can be used to find orthologous variants between
species. msa
table generated by orthoMSA
function is used as an
input. Other required files are:
-
Variant data (
df1
,df2
) asdata.table
ordata.frame
, with at least following columns:-
Refseq_ID
: NCBI reference sequence id of proteins. -
aapos
: Amino acid position of variants. -
from
: Reference amino acid. -
to
: Converted amino acid.
-
Id | aapos | Allele_frequency | from | to | Refseq_ID | Gene_name | type | Phenotype | Source |
---|---|---|---|---|---|---|---|---|---|
rs1185396016 | 240 | 5.50e-06 | N | S | NP_003261 | TSPAN6 | unknown | Unknown | gnomAD |
rs138104330 | 239 | 5.50e-06 | N | S | NP_003261 | TSPAN6 | unknown | Unknown | gnomAD |
rs778356735 | 237 | 5.50e-06 | I | T | NP_003261 | TSPAN6 | unknown | Unknown | gnomAD |
rs745504645 | 235 | 5.50e-06 | R | L | NP_003261 | TSPAN6 | unknown | Unknown | gnomAD |
rs745504645 | 235 | 1.47e-05 | R | H | NP_003261 | TSPAN6 | unknown | Unknown | gnomAD |
All arguments are listed below:
-
df1
: Variant data for the first organism -
df2
: Variant data for the second organism -
org1
: Scientific name of the first organism. -
org2
: Scientific name of the second organism. -
msa
: msa table which is the output oforthoMSA
function -
ort
: Should data be filtered according to type of variant (conserved and non-conserved). Default isTRUE
. Refer to paper for detailed explanation: <https://doi.org/10.1101/2021.01.07.424951>
Output is data.table
with each row representing a variant-orthologous
variant combination. Below is an example output:
C_elegans_ID | C_elegans_aapos | C_elegans_from | C_elegans_to | Human_ID | Human_aapos | Human_from | Human_to | msa_id |
---|---|---|---|---|---|---|---|---|
NP_510365 | 80 | P | S | NP_000545 | 64 | P | A | 20 |
NP_510365 | 80 | P | S | NP_000545 | 64 | P | T | 20 |
NP_510365 | 256 | P | T | NP_000545 | 272 | P | T | 20 |
NP_510365 | 256 | P | T | NP_000545 | 272 | P | S | 20 |
NP_510365 | 80 | P | S | NP_000545 | 64 | P | A | 21 |