Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies between using GoSearch and direct search on geneontology.org #243

Open
jasperhyp opened this issue Jul 11, 2022 · 1 comment

Comments

@jasperhyp
Copy link

jasperhyp commented Jul 11, 2022

Hi! Thanks for creating this wonderful library. I followed the instructions at https://github.com/tanghaibao/goatools/blob/main/notebooks/cell_cycle.ipynb to get GO-associated genes, but soon found out that there are some discrpancies between the results (in terms of # of ontology terms I get and also # of genes) derived here and from the website. I used go.obo (downloaded from here) instead of go-basic.obo to avoid data discrepancies.

I first noticed the number of GO terms hit are different. There are too many "cell cycle" hits so as an example, let's change "cell cycle" to "lipid metabolic". There are 9 hits on the website but 14 via the library. I manually examined the discrepancies and it looks like the library is capturing a lot of "<something>lipid metabolic" which the website does not, but cannot explain why the library hits "regulation of lipid metabolic process" (GO:0019216) and "membrane lipid metabolic process" (GO:0006643) but not "regulation of membrane lipid metabolic process" (GO:1905038).

Then, I changed "lipid metabolic" to " lipid metabolic" (adding a blank in the front) to avoid hitting the "<sth>lipid" stuff. Unexpectedly, there are only 6 hit terms left for the library (still 9 for the website). Weirdly, the library hit ether lipid metabolic process (GO:0046485) but not neutral lipid metabolic process (GO:0006638). There are more examples and I just couldn't understand why.

Yet again, to maximize replicability, I directly fed GO:0007049 (cell cycle) into srchhelp.add_children_gos() and I also went into that term through the website and ensured additional filtering (screenshot attached below). However, there are several hundred (unique) genes derived from the website but only 41 are derived from the library.

The code for feeding cell cycle directly is basically,

with open(fout_allgos, "w") as log:
    # Search for 'cell cycle' in GO terms
    # gos_cc_all = srchhelp.get_matching_gos(cell_cycle_all, prt=log)
    # Find any GOs matching 'cell cycle-independent' (e.g., "lysosome")
    # gos_no_cc = srchhelp.get_matching_gos(cell_cycle_not, gos=gos_cc_all, prt=log)
    # Remove GO terms that are not "cell cycle" GOs
    # gos = gos_cc_all.difference(gos_no_cc)
    # Add children GOs of cell cycle GOs
    # gos_all = srchhelp.add_children_gos(gos)
    gos_all = srchhelp.add_children_gos({'GO:0007049'})
    # Get Entrez GeneIDs for cell cycle GOs
    geneids = srchhelp.get_items(gos_all)

I would appreciate any help! I am new to GO so I might be doing many things wrongly. Please feel free to correct anything.

image

@jasperhyp jasperhyp changed the title Discrepancy between using GoSearch and direct search on geneontology.org Discrepancies between using GoSearch and direct search on geneontology.org Jul 11, 2022
@jasperhyp
Copy link
Author

I truly don't understand... The biological processes in go.obo were just not hit by exactly the same queries in many cases...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant