Hierarchical GO enrichment analysis with top-N roots clustering
Performs Gene Ontology (GO) enrichment analysis and organizes results into interpretable hierarchical clusters. Reduces hundreds of enriched terms to a small number of meaningful biological themes.
# Clone the repository
git clone https://github.com/Cellular-Semantics/goa_semantic_tools.git
cd goa_semantic_tools
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create environment and install dependencies
uv syncFirst run: GO ontology (~30MB) and gene annotations (~140MB) will be downloaded automatically and cached in reference_data/.
# Comma-separated gene list
uv run go-enrichment --genes TP53,BRCA1,BRCA2,PTEN,RB1,APC --output results/
# From a file (one gene per line, # for comments)
uv run go-enrichment --genes-file genes.txt --output results/
# With custom parameters
uv run go-enrichment \
--genes TP53,BRCA1,BRCA2 \
--species mouse \
--top-n 10 \
--fdr 0.01 \
--output results/ \
--name my_analysisAvailable options:
--genes: Comma-separated gene symbols--genes-file: File with gene symbols (one per line)--output,-o: Output directory (required)--species:humanormouse(default:human)--top-n: Number of cluster roots per namespace (default:5)--fdr: FDR threshold (default:0.05)--name: Custom name for output file (default: auto-generated)
from goa_semantic_tools.services import run_go_enrichment
# Run enrichment analysis
result = run_go_enrichment(
gene_symbols=["TP53", "BRCA1", "BRCA2", "PTEN", "RB1", "APC"],
species="human", # or "mouse"
top_n_roots=5, # Number of cluster roots per namespace
fdr_threshold=0.05 # FDR significance threshold
)
# Access results
print(f"Found {len(result['clusters'])} clusters")
print(f"Total enriched terms: {result['metadata']['total_enriched_terms']}")
# Explore first cluster
cluster = result['clusters'][0]
print(f"\nCluster: {cluster['root_term']['name']}")
print(f" FDR: {cluster['root_term']['fdr']:.2e}")
print(f" Fold enrichment: {cluster['root_term']['fold_enrichment']:.2f}x")
print(f" Study genes: {cluster['root_term']['study_genes']}")
print(f" Member terms: {len(cluster['member_terms'])}"){
"clusters": [
{
"root_term": {
"go_id": "GO:0008285",
"name": "negative regulation of cell population proliferation",
"namespace": "biological_process",
"fdr": 3.98e-09,
"fold_enrichment": 25.10,
"study_count": 10,
"study_genes": ["APC", "ATM", "BRCA2", "CDKN2A", ...]
},
"member_terms": [
# Descendant enriched terms...
],
"contributing_genes": [
{
"gene_symbol": "TP53",
"direct_annotations": [
{
"go_id": "GO:0008285",
"go_name": "negative regulation of...",
"evidence_code": "IDA"
}
]
}
]
}
],
"metadata": {
"input_genes_count": 14,
"genes_with_annotations": 14,
"total_enriched_terms": 282,
"clusters_count": 15,
"fdr_threshold": 0.05,
"species": "human"
}
}-
Downloads reference data (first run only):
- GO ontology from GO Consortium PURL
- Gene annotations (GAF) from EBI QuickGO
- Cached in
reference_data/for subsequent runs
-
Runs GO enrichment using GOATOOLS:
- Fisher's exact test with FDR correction
- Propagates counts up GO hierarchy
- Finds significantly enriched terms
-
Clusters results hierarchically:
- Selects top-N most significant terms as "cluster roots"
- Groups descendant terms under appropriate roots
- Reduces 100+ terms to 5-10 interpretable themes
-
Provides gene drill-down:
- Shows which genes contribute to each cluster
- Lists direct GO annotations with evidence codes
- Explains where enrichment comes from
The result follows go_enrichment_output.schema.json:
-
clusters: Array of hierarchical clustersroot_term: Most significant term (cluster root)- GO ID, name, namespace
- P-value, FDR, fold enrichment
- Study genes annotated to this term
member_terms: Descendant enriched terms in same hierarchycontributing_genes: Genes with their direct annotations
-
metadata: Analysis summary- Input/found gene counts
- Total enriched terms before clustering
- Number of clusters created
- Parameters used
# Integration tests (requires downloaded data)
uv run pytest -m integration
# All tests
uv run pytestsrc/goa_semantic_tools/goa_semantic_tools/
βββ services/
β βββ go_enrichment_service.py # Main enrichment pipeline
βββ utils/
β βββ data_downloader.py # Download & cache GO/GAF data
β βββ go_data_loader.py # Load data with GOATOOLS
β βββ go_hierarchy.py # Top-N clustering algorithm
βββ schemas/
βββ go_enrichment_input.schema.json
βββ go_enrichment_output.schema.json
- Python 3.14+
- GOATOOLS: GO enrichment analysis
- requests: Data downloads
- jsonschema: Schema validation
All managed via uv - see pyproject.toml.
- GO Ontology: http://purl.obolibrary.org/obo/go/go-basic.obo
- Human annotations: http://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/
- Mouse annotations: http://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/
Data is downloaded on first run and cached in reference_data/ (gitignored).
GOATOOLS_FINDINGS.md: Detailed API behavior and constraints from Week 0 validationCLAUDE.md: Development philosophy and Ring-based approachSCAFFOLD_GUIDE.md: Project scaffold and architecture decisions
Current Status: Ring 0 Phase 1 Complete β
- Hierarchical GO enrichment with top-N clustering
- CLI runner (
go-enrichmentcommand) - Python API
- Download-and-cache data strategy
Phase 2 (next):
- LLM-based natural language explanations of enrichment results
- Batch processing support
Ring 1 (after user validation):
- GO term definitions in explanations
- DeepSearch integration for experimental context
- RAG with literature references
- Graph visualizations
MIT License - see LICENSE for details.
Questions? See GOATOOLS_FINDINGS.md for technical details or CLAUDE.md for development guidelines.