simDEF: Definition-based Semantic Similarity Measure of GO Terms for Functional Similarity Analysis of Genes
The rapid growth of biomedical data annotated by Gene Ontology (GO) vocabulary demands an intelligent method of semantic similarity measurement between GO terms facilitating analysis of functional similarities of genes since compared with sequence and structure similarity, functional similarity is more informative for understanding the biological roles and functions of genes. Many important applications in computational molecular biology such as gene clustering, protein function prediction, protein interaction evaluation and disease gene prioritization require functional similarity. Some existing semantic similarity measures combine similarity scores of single GO term pairs to estimate gene functional similarity, whereas others compare terms in groups to measure it. Nevertheless, all of these measures are strictly dependent on the ever-changing topological structure of GO; they are extremely task dependent leaving no room for their generalization, and none of them takes the valuable textual definition of GO terms into consideration. These limitations present the challenge of measuring gene functional similarity reliably.
This project introduces simDEF, an efficient method for measuring semantic similarity of GO terms using their GO definitions. In essence, simDEF is an optimized version of Gloss Vector measure which is commonly used in natural language processing (NLP). Pointwise mutual information (PMI) is employed for this optimization. After constructing optimized definition-vectors of all GO terms, the cosine of the angle between terms’ definition-vectors represents the degree of similarity between them. Experimental studies show that simDEF outperforms existing semantic measures in terms of correlation with sequence homology and gene expression data and also demonstrate its superiority for prediction of true from false interactions in a protein-protein interaction (PPI) task. Relative to existing similarity measures, when validated on a yeast reference database (i.e. Saccharomyces cerevisiae), simDEF improves correlation with sequence homology by up to 50%, shows more than 4% correlation with gene expression in biological process hierarchy of GO, and increases protein-protein interaction (PPI) predictability by more than 2.5% in F1-score for molecular function hierarchy.
These free codes can be used, modified and redistributed without any restrictions.
Release date: September, 2015
Documentation: Please refer to the provided instruction file before use. (Highly recommended)
The datasets built in the study and employed in the evaluation analyses include (see the 'EXPERIMENTAL DATA' section, 'Validation datasets' subsection for detail):
- Sequence Homology Data (20,167 protein pairs)
- Gene Expression Data (4,800 protein pairs)
- PPI Data (6,000 protein pairs)
simDEF: Definition-based Semantic Similarity Measure of Gene Ontology Terms for Functional Similarity Analysis of Genes
Ahmad Pesaranghader; Stan Matwin; Marina Sokolova; Robert G. Beiko
Bioinformatics 2015;
doi: 10.1093/bioinformatics/btv755
(supplementary material file)
Ahmad Pesaranghader © 2015