Skip to content

ghndrx/kubeflow-pipelines

Repository files navigation

Healthcare ML Training Pipeline

Serverless GPU training infrastructure for healthcare NLP models. Training runs on RunPod serverless GPUs, with trained models stored in S3.

Overview

This project provides production-ready ML pipelines for training healthcare classification models:

  • Drug-Drug Interaction (DDI) - Severity classification from DrugBank (176K samples)
  • Adverse Drug Events (ADE) - Binary detection from ADE Corpus V2 (30K samples)
  • Medical Triage - Urgency level classification
  • Symptom-to-Disease - Diagnosis prediction (41 disease classes)

All models use Bio_ClinicalBERT as the base and are fine-tuned on domain-specific datasets.

Training Results

Task Dataset Samples Accuracy F1 Score
DDI Classification DrugBank 176K 100% 100%
ADE Detection ADE Corpus V2 9K 93.5% 95.3%
Symptom-Disease Disease Symptoms 4.4K 100% 100%

Quick Start

Run Training

curl -X POST "https://api.runpod.ai/v2/YOUR_ENDPOINT/run" \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "task": "ddi",
      "model_name": "emilyalsentzer/Bio_ClinicalBERT",
      "max_samples": 10000,
      "epochs": 3,
      "batch_size": 16,
      "s3_bucket": "your-bucket",
      "aws_access_key_id": "...",
      "aws_secret_access_key": "...",
      "aws_session_token": "..."
    }
  }'

Available tasks: ddi, ade, triage, symptom_disease

Download Trained Model

aws s3 cp s3://your-bucket/model.tar.gz .
tar -xzf model.tar.gz

Project Structure

├── components/
│   └── runpod_trainer/
│       ├── Dockerfile
│       ├── handler.py          # Multi-task training logic
│       ├── requirements.txt
│       └── data/               # DrugBank DDI dataset
├── pipelines/
│   ├── healthcare_training.py  # Kubeflow pipeline definitions
│   ├── ddi_training_runpod.py
│   └── ddi_data_prep.py
├── .github/workflows/
│   └── build-trainer.yaml      # CI/CD
└── manifests/
    └── argocd-app.yaml

Configuration

All configuration is via environment variables. Copy .env.example to .env and fill in your values:

cp .env.example .env
# Edit .env with your credentials

Environment Variables

Variable Required Default Description
RUNPOD_API_KEY Yes - RunPod API key
RUNPOD_ENDPOINT Yes - RunPod serverless endpoint ID
AWS_ACCESS_KEY_ID Yes - AWS credentials for S3
AWS_SECRET_ACCESS_KEY Yes - AWS credentials for S3
AWS_SESSION_TOKEN No - For assumed role sessions
AWS_REGION No us-east-1 AWS region
S3_BUCKET Yes - Bucket for model artifacts
BASE_MODEL No Bio_ClinicalBERT HuggingFace model ID
MAX_SAMPLES No 10000 Training samples
EPOCHS No 3 Training epochs
BATCH_SIZE No 16 Batch size

Kubernetes Secrets (Recommended)

For production, use Kubernetes secrets instead of environment variables:

apiVersion: v1
kind: Secret
metadata:
  name: ml-pipeline-secrets
type: Opaque
stringData:
  RUNPOD_API_KEY: "your-key"
  AWS_ACCESS_KEY_ID: "your-key"
  AWS_SECRET_ACCESS_KEY: "your-secret"

Supported Models

Model Type Use Case
emilyalsentzer/Bio_ClinicalBERT BERT Classification tasks
meta-llama/Llama-3.1-8B-Instruct LLM Text generation (LoRA)
google/gemma-3-4b-it LLM Lightweight inference

Parameters

Parameter Default Description
task ddi Training task
model_name Bio_ClinicalBERT HuggingFace model ID
max_samples 10000 Training samples
epochs 3 Training epochs
batch_size 16 Batch size
eval_split 0.1 Validation split
s3_bucket - S3 bucket for output

Development

# Build container
cd components/runpod_trainer
docker build -t healthcare-trainer .

# Trigger CI build
gh workflow run build-trainer.yaml

License

MIT

About

ML training pipelines with RunPod serverless GPU infrastructure. Includes DDI drug interaction classifier using Bio_ClinicalBERT on 176K DrugBank samples.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors