Artemis Pipeline

Artemis: Indication-Aware Target Prioritization by Integrating Public Knowledge Graphs with Clinical Data

A Nextflow pipeline to reproduce the results from the Artemis paper.

Pipeline Overview

Figure: Complete workflow diagram showing data ingestion, knowledge graph analysis, model training, target prediction, and visualization steps.

Overview

This pipeline performs:

Data ingestion: ChEMBL drugs, MeSH disease terms, clinical scores from trials
Knowledge graph analysis: Feature extraction from Hetionet, BioKG, OpenBioLink, and PrimeKG
Cross-validation: Binary, multiclass, and regression models for 7 disease indications
Target prediction: Random forest classifiers trained on KG embeddings + clinical scores
Consensus analysis: Averaging predictions across knowledge graphs and iterations with hierarchical clustering
Visualization: Upset plots, heatmaps, ROC curves, baseline statistics, and consensus clustermaps

Quick Start

Prerequisites

Nextflow ≥ 23.04
Docker or Singularity

Installation

# Pipeline runs with pre-built Docker image (public.ecr.aws/alethiotx/artemis-paper:latest)
nextflow run main.nf -profile local

Run the pipeline

# Default mode: predictions
nextflow run main.nf -profile local

# Specify a different mode
nextflow run main.nf -profile local --mode cv

# Use specific scores date
nextflow run main.nf -profile local --scores_date 2025-12-12

Pipeline Modes

Controlled via --mode parameter:

1. `data`

Download and preprocess external datasets:

ChEMBL drug database
MeSH disease ontology

Usage:

nextflow run main.nf --mode data

2. `scores`

Compute clinical target scores and pathway gene sets for 7 indications:

Breast, Lung, Bowel, Prostate, Melanoma, Diabetes, Cardiovascular

Usage:

nextflow run main.nf --mode scores --scores_date 2025-12-12

3. `cv` (Cross-Validation)

Evaluate ML models (Random Forest) across:

4 knowledge graphs
7 disease indications
3 task types: binary, multiclass, regression

Outputs:

ROC-AUC scores per KG/indication/task
Feature importance rankings
Learning curves (subsampling analysis)

Usage:

nextflow run main.nf --mode cv

4. `predictions` (default)

Train classifiers and predict targets:

Grid search: 4 KGs × 3 filtering modes × 5 RF thresholds × 3 pathway gene counts × 10 iterations
Outputs: predicted targets, training sets, cross-indication overlap matrices, SABCS validation, baseline statistics
Aggregates all predictions and training labels into unified pickle files for downstream analysis
Baseline Analysis:
- Computes baseline statistics (percentage of targets above threshold) averaged across iterations and KGs
- Generates per-indication heatmaps showing baseline percentages by clinical trial filter, RF threshold, and pathway genes
- Creates comprehensive boxplot comparing baseline distributions across all 7 indications
- Produces publication-ready comparison plot faceting baseline, sensitivity, and specificity (plotnine/ggplot2 style)
SABCS Consensus:
- Generates standard heatmap grids for each RF threshold (pathway genes × clinical trial filters × KGs)
- Creates horizontal heatmap for RF 0.7 with Unique CT filter across all 4 KGs for focused comparison
- Performs hierarchical clustering to identify consensus predictions across knowledge graphs

Usage:

nextflow run main.nf --mode predictions --scores_date 2025-12-12

5. `post_predictions`

Recompute baseline statistics and consensus analysis using existing prediction results:

Loads pre-computed target predictions and training sets from S3
Regenerates baseline statistics and visualizations across all indications
Recomputes SABCS consensus predictions without retraining models
Updates visualization styles (e.g., plotnine-based plots with enhanced aesthetics)
Useful for updating visualizations or analysis parameters without re-running expensive compute jobs

Usage:

nextflow run main.nf --mode post_predictions

6. `kgs`

Generate knowledge graph overview notebook (statistics, entity counts, relationship types).

Usage:

nextflow run main.nf --mode kgs

7. `upset`

Create UpSet plots showing target overlap across knowledge graphs and filtering strategies.

Usage:

nextflow run main.nf --mode upset

Configuration

Parameters (`nextflow.config`)

Parameter	Default	Description
`mode`	`predictions`	Pipeline mode (see above)
`scores_date`	`2025-12-12`	Clinical scores snapshot date (YYYY-MM-DD)
`outdir`	`s3://alethiotx-artemis`	Output directory (S3 or local path)
`chembl_version`	`36`	ChEMBL database version
`mesh_file_base`	`d2025`	MeSH vocabulary release

Profiles

local: Docker execution on local machine
- Base config: 8 CPUs, 16 GB RAM, 8h timeout
- CV tasks: 16 CPUs, 64 GB RAM (auto-retry with more memory)
seqera: Cloud execution via Seqera Platform (Tower)

Override example:

nextflow run main.nf -profile local --outdir ./results --scores_date 2024-09-04

Outputs

Predictions mode

s3://alethiotx-artemis/
├── figs/
│   ├── predictions/
│   │   ├── plots/
│   │   │   ├── indications/   # Per-indication sensitivity plots
│   │   │   ├── kgs/           # KG comparison boxplots
│   │   │   └── heatmaps/      # Cross-indication overlap heatmaps
│   │   └── data/              # Combined prediction data
│   ├── predicted_targets/
│   │   └── all_targets.pickle # Unified target probabilities (all indications)
│   ├── training_sets/
│   │   └── all_training_sets.pickle # Unified training labels (all indications)
│   ├── baselines/
│   │   ├── plots/
│   │   │   ├── heatmaps/
│   │   │   │   ├── breast.png     # Per-indication baseline heatmaps
│   │   │   │   └── ...
│   │   │   ├── all_baseline.png   # Boxplots across all indications
│   │   │   └── for_paper.png      # Publication-ready comparison plot (plotnine)
│   │   └── data/
│   │       ├── indications/
│   │       │   ├── breast.pickle  # Per-indication baseline statistics
│   │       │   └── ...
│   │       └── for_paper.csv      # Combined baseline & predictions data
│   ├── sabcs/
│   │   ├── plots/
│   │   │   ├── 0.5.png            # Standard heatmap grid
│   │   │   ├── ...
│   │   │   └── 0.7_unique_horizontal.png  # Horizontal Unique-only plot
│   │   └── data/              # SABCS overlap data
│   └── sabcs_consensus/
│       ├── plots/             # Consensus prediction heatmaps & clustermaps
│       └── data/all.pickle    # Consensus predictions

CV mode

s3://alethiotx-artemis/figs/cv/
├── data/
│   └── combined_cv_scores.csv
└── plots/
    ├── roc_curves_*.png
    └── learning_curves.png

Architecture

Knowledge Graphs

Hetionet: Integrated biomedical KG (nodes: 47K, edges: 2.2M)
BioKG: Drug-disease-gene KG from multiple sources
OpenBioLink: Open biomedical link prediction benchmark
PrimeKG: Precision medicine KG (diseases, drugs, genes, pathways)

Workflow Structure

main.nf
├── modules/
│   ├── data/              # ChEMBL + MeSH ingestion
│   ├── clinical_scores/   # Trial data → target scores
│   ├── pathway_genes/     # Reactome/KEGG enrichment
│   ├── cv/                # Cross-validation experiments
│   ├── predictions/       # Target prediction + ranking
│   │   ├── compute.py         # Generate predictions per parameter combo
│   │   ├── combine.py         # Aggregate prediction overlaps
│   │   ├── targets.py         # Aggregate target probabilities
│   │   ├── training_sets.py   # Aggregate training labels
│   │   ├── baselines.py       # Compute baseline statistics
│   │   ├── sabcs.py           # SABCS overlap analysis
│   │   └── consensus_sabcs.py # Consensus predictions for SABCS
│   ├── upset/             # Set overlap visualization
│   └── kgs/               # KG summary notebook
└── conf/
    ├── base.config        # Resource defaults
    └── local.config       # Local execution settings

Dependencies

Core Python Packages

pandas, numpy, scipy: Data manipulation
scikit-learn: ML models (Random Forest, SVM)
pykeen==1.11.1: Knowledge graph embeddings
alethiotx>=2.0.9: Proprietary data access utilities
plotnine: ggplot2-style visualization for publication-ready plots
seaborn, matplotlib: Statistical visualization
upsetplot: Set intersection plots
fsspec, s3fs: Cloud storage I/O

See requirements.txt for full list.

Container

Pre-built image: public.ecr.aws/alethiotx/artemis-paper:latest
Built from Dockerfile with Python 3.13 on Ubuntu 25.10.

Development

Build Docker image locally

docker build -t artemis-paper:dev .

Run single process

nextflow run main.nf -profile local -entry cv

Resume failed runs

nextflow run main.nf -profile local -resume

Clean work directory

nextflow clean -f
rm -rf work/

Python API

The pipeline uses the alethiotx Python package internally. For interactive analysis, you can use the same utilities:

Example: Load clinical scores

from alethiotx.artemis.clinical.scores import load as load_clinical_scores
from alethiotx.artemis.clinical.scores import approved, unique

# Load clinical target scores for all 7 indications
breast, lung, prostate, melanoma, bowel, diabetes, cardiovascular = \
    load_clinical_scores(date='2025-12-10')

# Filter to approved drugs only (score > 20)
approved_scores = approved(
    [breast, lung, prostate, melanoma, bowel, diabetes, cardiovascular]
)

# Or filter to unique targets
unique_scores = unique(
    [breast, lung, prostate, melanoma, bowel, diabetes, cardiovascular]
)

Example: Load pathway genes

from alethiotx.artemis.pathway.genes import load as load_pathway_genes

# Get pathway genes (top 100 per indication)
pathway_genes = load_pathway_genes(date='2025-12-10', n=100)

Example: Train classifier

from alethiotx.artemis.cv.pipeline import prepare as prepare_model
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load KG features
kg_features = pd.read_csv(
    's3://alethiotx-artemis/data/kgs/associations/biokg/summarize/predictions.csv',
    index_col=0
)

# Prepare training data
res = prepare_model(
    kg_features, 
    breast,
    pathway_genes=pathway_genes[0],
    rand_seed=42
)

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(res['X'], res['y_binary'])

# Predict probabilities
probs = rf.predict_proba(kg_features)

See the alethiotx package documentation for full API reference.

Citation

Preprint: TBD
Code: https://github.com/alethiomics/artemis-paper
License: MIT (see LICENSE)

Support

Issues: https://github.com/alethiomics/artemis-paper/issues
Contact: [email protected]

Acknowledgements

Public knowledge graph providers (Hetionet, OpenBioLink, PrimeKG)
ChEMBL and MeSH data sources
PyKEEN, scikit-learn, and Nextflow communities
Portions of this codebase were generated, refactored, and/or cleaned using GitHub Copilot (Claude Sonnet 4.5). The authors reviewed, modified, and validated all AI-assisted code. Responsibility for the correctness, performance, and reproducibility of the code rests entirely with the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
conf		conf
modules		modules
terraform		terraform
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dag.svg		dag.svg
main.nf		main.nf
nextflow.config		nextflow.config
requirements.txt		requirements.txt

License

alethiotx/artemis-paper

Folders and files

Latest commit

History

Repository files navigation

Artemis Pipeline

Pipeline Overview

Overview

Quick Start

Prerequisites

Installation

Run the pipeline

Pipeline Modes

1. data

2. scores

3. cv (Cross-Validation)

4. predictions (default)

5. post_predictions

6. kgs

7. upset

Configuration

Parameters (nextflow.config)

Profiles

Outputs

Predictions mode

CV mode

Architecture

Knowledge Graphs

Workflow Structure

Dependencies

Core Python Packages

Container

Development

Build Docker image locally

Run single process

Resume failed runs

Clean work directory

Python API

Example: Load clinical scores

Example: Load pathway genes

Example: Train classifier

Citation

Support

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

1. `data`

2. `scores`

3. `cv` (Cross-Validation)

4. `predictions` (default)

5. `post_predictions`

6. `kgs`

7. `upset`

Parameters (`nextflow.config`)