Artemis: Indication-Aware Target Prioritization by Integrating Public Knowledge Graphs with Clinical Data
A Nextflow pipeline to reproduce the results from the Artemis paper.
Figure: Complete workflow diagram showing data ingestion, knowledge graph analysis, model training, target prediction, and visualization steps.
This pipeline performs:
- Data ingestion: ChEMBL drugs, MeSH disease terms, clinical scores from trials
- Knowledge graph analysis: Feature extraction from Hetionet, BioKG, OpenBioLink, and PrimeKG
- Cross-validation: Binary, multiclass, and regression models for 7 disease indications
- Target prediction: Random forest classifiers trained on KG embeddings + clinical scores
- Consensus analysis: Averaging predictions across knowledge graphs and iterations with hierarchical clustering
- Visualization: Upset plots, heatmaps, ROC curves, baseline statistics, and consensus clustermaps
- Nextflow ≥ 23.04
- Docker or Singularity
# Pipeline runs with pre-built Docker image (public.ecr.aws/alethiotx/artemis-paper:latest)
nextflow run main.nf -profile local# Default mode: predictions
nextflow run main.nf -profile local
# Specify a different mode
nextflow run main.nf -profile local --mode cv
# Use specific scores date
nextflow run main.nf -profile local --scores_date 2025-12-12Controlled via --mode parameter:
Download and preprocess external datasets:
- ChEMBL drug database
- MeSH disease ontology
Usage:
nextflow run main.nf --mode dataCompute clinical target scores and pathway gene sets for 7 indications:
- Breast, Lung, Bowel, Prostate, Melanoma, Diabetes, Cardiovascular
Usage:
nextflow run main.nf --mode scores --scores_date 2025-12-12Evaluate ML models (Random Forest) across:
- 4 knowledge graphs
- 7 disease indications
- 3 task types: binary, multiclass, regression
Outputs:
- ROC-AUC scores per KG/indication/task
- Feature importance rankings
- Learning curves (subsampling analysis)
Usage:
nextflow run main.nf --mode cvTrain classifiers and predict targets:
- Grid search: 4 KGs × 3 filtering modes × 5 RF thresholds × 3 pathway gene counts × 10 iterations
- Outputs: predicted targets, training sets, cross-indication overlap matrices, SABCS validation, baseline statistics
- Aggregates all predictions and training labels into unified pickle files for downstream analysis
- Baseline Analysis:
- Computes baseline statistics (percentage of targets above threshold) averaged across iterations and KGs
- Generates per-indication heatmaps showing baseline percentages by clinical trial filter, RF threshold, and pathway genes
- Creates comprehensive boxplot comparing baseline distributions across all 7 indications
- Produces publication-ready comparison plot faceting baseline, sensitivity, and specificity (plotnine/ggplot2 style)
- SABCS Consensus:
- Generates standard heatmap grids for each RF threshold (pathway genes × clinical trial filters × KGs)
- Creates horizontal heatmap for RF 0.7 with Unique CT filter across all 4 KGs for focused comparison
- Performs hierarchical clustering to identify consensus predictions across knowledge graphs
Usage:
nextflow run main.nf --mode predictions --scores_date 2025-12-12Recompute baseline statistics and consensus analysis using existing prediction results:
- Loads pre-computed target predictions and training sets from S3
- Regenerates baseline statistics and visualizations across all indications
- Recomputes SABCS consensus predictions without retraining models
- Updates visualization styles (e.g., plotnine-based plots with enhanced aesthetics)
- Useful for updating visualizations or analysis parameters without re-running expensive compute jobs
Usage:
nextflow run main.nf --mode post_predictionsGenerate knowledge graph overview notebook (statistics, entity counts, relationship types).
Usage:
nextflow run main.nf --mode kgsCreate UpSet plots showing target overlap across knowledge graphs and filtering strategies.
Usage:
nextflow run main.nf --mode upset| Parameter | Default | Description |
|---|---|---|
mode |
predictions |
Pipeline mode (see above) |
scores_date |
2025-12-12 |
Clinical scores snapshot date (YYYY-MM-DD) |
outdir |
s3://alethiotx-artemis |
Output directory (S3 or local path) |
chembl_version |
36 |
ChEMBL database version |
mesh_file_base |
d2025 |
MeSH vocabulary release |
local: Docker execution on local machine- Base config: 8 CPUs, 16 GB RAM, 8h timeout
- CV tasks: 16 CPUs, 64 GB RAM (auto-retry with more memory)
seqera: Cloud execution via Seqera Platform (Tower)
Override example:
nextflow run main.nf -profile local --outdir ./results --scores_date 2024-09-04s3://alethiotx-artemis/
├── figs/
│ ├── predictions/
│ │ ├── plots/
│ │ │ ├── indications/ # Per-indication sensitivity plots
│ │ │ ├── kgs/ # KG comparison boxplots
│ │ │ └── heatmaps/ # Cross-indication overlap heatmaps
│ │ └── data/ # Combined prediction data
│ ├── predicted_targets/
│ │ └── all_targets.pickle # Unified target probabilities (all indications)
│ ├── training_sets/
│ │ └── all_training_sets.pickle # Unified training labels (all indications)
│ ├── baselines/
│ │ ├── plots/
│ │ │ ├── heatmaps/
│ │ │ │ ├── breast.png # Per-indication baseline heatmaps
│ │ │ │ └── ...
│ │ │ ├── all_baseline.png # Boxplots across all indications
│ │ │ └── for_paper.png # Publication-ready comparison plot (plotnine)
│ │ └── data/
│ │ ├── indications/
│ │ │ ├── breast.pickle # Per-indication baseline statistics
│ │ │ └── ...
│ │ └── for_paper.csv # Combined baseline & predictions data
│ ├── sabcs/
│ │ ├── plots/
│ │ │ ├── 0.5.png # Standard heatmap grid
│ │ │ ├── ...
│ │ │ └── 0.7_unique_horizontal.png # Horizontal Unique-only plot
│ │ └── data/ # SABCS overlap data
│ └── sabcs_consensus/
│ ├── plots/ # Consensus prediction heatmaps & clustermaps
│ └── data/all.pickle # Consensus predictions
s3://alethiotx-artemis/figs/cv/
├── data/
│ └── combined_cv_scores.csv
└── plots/
├── roc_curves_*.png
└── learning_curves.png
- Hetionet: Integrated biomedical KG (nodes: 47K, edges: 2.2M)
- BioKG: Drug-disease-gene KG from multiple sources
- OpenBioLink: Open biomedical link prediction benchmark
- PrimeKG: Precision medicine KG (diseases, drugs, genes, pathways)
main.nf
├── modules/
│ ├── data/ # ChEMBL + MeSH ingestion
│ ├── clinical_scores/ # Trial data → target scores
│ ├── pathway_genes/ # Reactome/KEGG enrichment
│ ├── cv/ # Cross-validation experiments
│ ├── predictions/ # Target prediction + ranking
│ │ ├── compute.py # Generate predictions per parameter combo
│ │ ├── combine.py # Aggregate prediction overlaps
│ │ ├── targets.py # Aggregate target probabilities
│ │ ├── training_sets.py # Aggregate training labels
│ │ ├── baselines.py # Compute baseline statistics
│ │ ├── sabcs.py # SABCS overlap analysis
│ │ └── consensus_sabcs.py # Consensus predictions for SABCS
│ ├── upset/ # Set overlap visualization
│ └── kgs/ # KG summary notebook
└── conf/
├── base.config # Resource defaults
└── local.config # Local execution settings
pandas,numpy,scipy: Data manipulationscikit-learn: ML models (Random Forest, SVM)pykeen==1.11.1: Knowledge graph embeddingsalethiotx>=2.0.9: Proprietary data access utilitiesplotnine: ggplot2-style visualization for publication-ready plotsseaborn,matplotlib: Statistical visualizationupsetplot: Set intersection plotsfsspec,s3fs: Cloud storage I/O
See requirements.txt for full list.
Pre-built image: public.ecr.aws/alethiotx/artemis-paper:latest
Built from Dockerfile with Python 3.13 on Ubuntu 25.10.
docker build -t artemis-paper:dev .nextflow run main.nf -profile local -entry cvnextflow run main.nf -profile local -resumenextflow clean -f
rm -rf work/The pipeline uses the alethiotx Python package internally. For interactive analysis, you can use the same utilities:
from alethiotx.artemis.clinical.scores import load as load_clinical_scores
from alethiotx.artemis.clinical.scores import approved, unique
# Load clinical target scores for all 7 indications
breast, lung, prostate, melanoma, bowel, diabetes, cardiovascular = \
load_clinical_scores(date='2025-12-10')
# Filter to approved drugs only (score > 20)
approved_scores = approved(
[breast, lung, prostate, melanoma, bowel, diabetes, cardiovascular]
)
# Or filter to unique targets
unique_scores = unique(
[breast, lung, prostate, melanoma, bowel, diabetes, cardiovascular]
)from alethiotx.artemis.pathway.genes import load as load_pathway_genes
# Get pathway genes (top 100 per indication)
pathway_genes = load_pathway_genes(date='2025-12-10', n=100)from alethiotx.artemis.cv.pipeline import prepare as prepare_model
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Load KG features
kg_features = pd.read_csv(
's3://alethiotx-artemis/data/kgs/associations/biokg/summarize/predictions.csv',
index_col=0
)
# Prepare training data
res = prepare_model(
kg_features,
breast,
pathway_genes=pathway_genes[0],
rand_seed=42
)
# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(res['X'], res['y_binary'])
# Predict probabilities
probs = rf.predict_proba(kg_features)See the alethiotx package documentation for full API reference.
Preprint: TBD
Code: https://github.com/alethiomics/artemis-paper
License: MIT (see LICENSE)
- Issues: https://github.com/alethiomics/artemis-paper/issues
- Contact: [email protected]
- Public knowledge graph providers (Hetionet, OpenBioLink, PrimeKG)
- ChEMBL and MeSH data sources
- PyKEEN, scikit-learn, and Nextflow communities
- Portions of this codebase were generated, refactored, and/or cleaned using GitHub Copilot (Claude Sonnet 4.5). The authors reviewed, modified, and validated all AI-assisted code. Responsibility for the correctness, performance, and reproducibility of the code rests entirely with the authors.