Skip to content

simDEF is an NLP-based model for gene function analysis using Gene Ontology annotations of gene products and proteins.

License

Notifications You must be signed in to change notification settings

ahmadpgh/simDEF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

simDEF: Definition-based Semantic Similarity Measure of GO Terms for Functional Similarity Analysis of Genes

Background

The rapid growth of biomedical data annotated by Gene Ontology (GO) vocabulary demands an intelligent method of semantic similarity measurement between GO terms facilitating analysis of functional similarities of genes since compared with sequence and structure similarity, functional similarity is more informative for understanding the biological roles and functions of genes. Many important applications in computational molecular biology such as gene clustering, protein function prediction, protein interaction evaluation and disease gene prioritization require functional similarity. Some existing semantic similarity measures combine similarity scores of single GO term pairs to estimate gene functional similarity, whereas others compare terms in groups to measure it. Nevertheless, all of these measures are strictly dependent on the ever-changing topological structure of GO; they are extremely task dependent leaving no room for their generalization, and none of them takes the valuable textual definition of GO terms into consideration. These limitations present the challenge of measuring gene functional similarity reliably.

Results and conclusions

This project introduces simDEF, an efficient method for measuring semantic similarity of GO terms using their GO definitions. In essence, simDEF is an optimized version of Gloss Vector measure which is commonly used in natural language processing (NLP). Pointwise mutual information (PMI) is employed for this optimization. After constructing optimized definition-vectors of all GO terms, the cosine of the angle between terms’ definition-vectors represents the degree of similarity between them. Experimental studies show that simDEF outperforms existing semantic measures in terms of correlation with sequence homology and gene expression data and also demonstrate its superiority for prediction of true from false interactions in a protein-protein interaction (PPI) task. Relative to existing similarity measures, when validated on a yeast reference database (i.e. Saccharomyces cerevisiae), simDEF improves correlation with sequence homology by up to 50%, shows more than 4% correlation with gene expression in biological process hierarchy of GO, and increases protein-protein interaction (PPI) predictability by more than 2.5% in F1-score for molecular function hierarchy.

Availability

These free codes can be used, modified and redistributed without any restrictions.
Release date: September, 2015
Documentation: Please refer to the provided instruction file before use. (Highly recommended)

Datasets for the evaluation

The datasets built in the study and employed in the evaluation analyses include (see the 'EXPERIMENTAL DATA' section, 'Validation datasets' subsection for detail):

  1. Sequence Homology Data (20,167 protein pairs)
  2. Gene Expression Data (4,800 protein pairs)
  3. PPI Data (6,000 protein pairs)

Citation

simDEF: Definition-based Semantic Similarity Measure of Gene Ontology Terms for Functional Similarity Analysis of Genes Ahmad Pesaranghader; Stan Matwin; Marina Sokolova; Robert G. Beiko
Bioinformatics 2015;
doi: 10.1093/bioinformatics/btv755 (supplementary material file)



Ahmad Pesaranghader © 2015

About

simDEF is an NLP-based model for gene function analysis using Gene Ontology annotations of gene products and proteins.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published