Medical AI Risk Evaluation Framework — A benchmark suite for evaluating safety, harmfulness, and groundedness of medical AI systems.
Maintainer: Jean-Philippe Corbeil (jcorbeil@microsoft.com)
MedRiskEval provides a unified framework to evaluate large language models across five medical safety benchmarks:
| Benchmark | ID | Description | Samples | Default Judge |
|---|---|---|---|---|
| PatientSafetyBench | psb |
Patient-facing safety queries | 466 | gpt-4 |
| MedSafetyBench | msb |
Clinician-facing ethical queries (9 AMA categories) | 450 | gpt-4 |
| JailbreakBench | jbb |
Jailbreak resistance evaluation | 100 harmful + 100 benign | gpt-4-0806 |
| XSTest | xstest |
Over-refusal / exaggerated safety behavior | 250 safe + 200 unsafe | gpt-4-0806 |
| FACTS-med | facts_med |
Groundedness against reference documents | 219 | gpt-4 |
!THIS REPOSITORY IS FOR RESEARCH PURPOSES!
This benchmark and its outputs are research artefacts only. Results produced by MedRiskEval do not constitute a guarantee of safety, reliability, or fitness for any particular use case. A passing score does not imply that a model is safe for deployment in clinical or patient-facing settings.
Proper red-teaming, domain-expert review, and regulatory compliance assessments should still be carried out before deploying any language model in healthcare environments. MedRiskEval is intended to support — not replace — comprehensive safety evaluation processes.
The authors and contributors assume no liability for decisions made based on benchmark results.
pip install -e .Core requirements (installed automatically):
datasets>=2.0.0pydantic>=2.0.0typer>=0.9.0pyyaml>=6.0.0openai>=1.0.0httpx>=0.24.0rich>=13.0.0tqdm>=4.0.0
Optional:
kagglehub— for automatic FACTS dataset download from Kagglevllm— for local model serving (Linux + GPU only)
Some benchmarks require external data. Run the setup script to download and place everything:
python setup_datasets.py| Dataset | Source | Action |
|---|---|---|
| PSB | HuggingFace: microsoft/PatientSafetyBench | Auto-downloaded and cached on first use. Pre-download optional. |
| MSB | GitHub: AI4LIFE-GROUP/med-safety-bench | Clones the repository into ./med-safety-bench/. Requires git. |
| JBB | HuggingFace: JailbreakBench/JBB-Behaviors | Auto-downloaded and cached on first use. Pre-download optional. |
| XSTest | HuggingFace: walledai/XSTest | Auto-downloaded and cached on first use. Pre-download optional. |
| FACTS-med | Kaggle: deepmind/facts-grounding-examples | Downloads via kagglehub, or searches common local paths. Places FACTS_examples.csv in the project root. |
# Full setup: clone MSB, download FACTS, pre-cache HF datasets
python setup_datasets.py
# Skip HuggingFace pre-download (they auto-download at runtime anyway)
python setup_datasets.py --skip-hf
# Provide FACTS CSV path manually
python setup_datasets.py --facts-csv ~/Downloads/FACTS_examples.csvIf you prefer to set up datasets manually:
-
MSB — Clone the repository:
git clone https://github.com/AI4LIFE-GROUP/med-safety-bench.git
-
FACTS-med — Download
FACTS_examples.csvfrom Kaggle and place it in the project root. Or installkagglehub:pip install kagglehub python -c "import kagglehub; print(kagglehub.dataset_download('deepmind/facts-grounding-examples'))" -
PSB, JBB, XSTest — No action needed. They download automatically from HuggingFace on first run.
medriskeval/
├── FACTS_examples.csv # FACTS-med data
├── med-safety-bench/ # MSB repository clone
│ └── datasets/test/gpt4/
│ ├── med_safety_demonstrations_category_1.csv
│ ├── ...
│ └── med_safety_demonstrations_category_9.csv
├── configs/
├── medriskeval/
├── runs/
├── cache/
└── setup_datasets.py
MedRiskEval can be run via python -m medriskeval.cli.main or the medriskeval entry point (if installed with pip install -e .).
| Variable | Description | Required |
|---|---|---|
AZURE_OPENAI_API_KEY |
Azure OpenAI API key | For Azure provider |
AZURE_OPENAI_BASE_URL |
Azure OpenAI endpoint URL | For Azure provider |
OPENAI_API_KEY |
OpenAI API key | For OpenAI provider |
MEDRISKEVAL_CACHE_DIR |
Override default cache directory | No |
AZURE_OPENAI_API_KEY=<your-key> \
python -m medriskeval.cli.main run-config configs/quick_test.yaml# Full evaluation across all benchmarks
AZURE_OPENAI_API_KEY=<your-key> \
python -m medriskeval.cli.main run-config configs/full_eval.yaml
# Dry run — show execution plan without running
python -m medriskeval.cli.main run-config configs/full_eval.yaml --dry-run
# Verbose output
AZURE_OPENAI_API_KEY=<your-key> \
python -m medriskeval.cli.main run-config configs/full_eval.yaml --verbose# PSB with Azure model
AZURE_OPENAI_API_KEY=<your-key> \
python -m medriskeval.cli.main run psb azure:gpt-4.1-mini
# JBB with OpenAI model and custom judge
OPENAI_API_KEY=<your-key> \
python -m medriskeval.cli.main run jbb openai:gpt-4 --judge openai:gpt-4-0806
# Limit samples for quick testing
python -m medriskeval.cli.main run xstest openai:gpt-4.1-mini --max-samples 10Start the vLLM server (Linux + GPU required):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct --port 8000 --host 0.0.0.0Then run the evaluation:
AZURE_OPENAI_API_KEY=<your-key> \
python -m medriskeval.cli.main run-config configs/vllm_azure_judge.yamlThe summarize command group has three subcommands:
# Table format (default)
python -m medriskeval.cli.main summarize summarize runs/psb/gpt-4.1-mini_20260410_181103
# Export as CSV
python -m medriskeval.cli.main summarize summarize runs/psb/gpt-4.1-mini_20260410_181103 --format csv -o results.csv
# Export as Markdown
python -m medriskeval.cli.main summarize summarize runs/psb/gpt-4.1-mini_20260410_181103 --format markdown
# Export as JSON
python -m medriskeval.cli.main summarize summarize runs/psb/gpt-4.1-mini_20260410_181103 --format jsonOutput formats: table (default), json, csv, markdown.
# List all runs
python -m medriskeval.cli.main summarize list-runs
# Filter by benchmark
python -m medriskeval.cli.main summarize list-runs --benchmark psbpython -m medriskeval.cli.main summarize compare runs/psb/run1 runs/psb/run2
# Export comparison as CSV or Markdown
python -m medriskeval.cli.main summarize compare runs/psb/run1 runs/psb/run2 --format csv -o comparison.csv
python -m medriskeval.cli.main summarize compare runs/psb/run1 runs/psb/run2 --format markdownpython -m medriskeval.cli.main list-tasks# Tasks to run
tasks:
- benchmark: psb # Required: psb | msb | jbb | xstest | facts_med
split: test # Optional: dataset split (default: test)
max_samples: 10 # Optional: limit examples for quick testing
# Models to evaluate (each task runs against every model)
models:
- provider: azure # azure | openai | vllm
model_id: gpt-4.1-mini # Model name / deployment name
api_key: ${AZURE_OPENAI_API_KEY} # Supports env var interpolation
base_url: ${AZURE_OPENAI_BASE_URL:https://default.openai.azure.com/}
api_version: "2025-01-01-preview" # Azure-specific
timeout: 60.0 # Request timeout in seconds
generation:
temperature: 0.0
max_tokens: 256
# Judge model (evaluates model outputs)
judge:
provider: azure
model_id: gpt-4.1
api_key: ${AZURE_OPENAI_API_KEY}
base_url: ${AZURE_OPENAI_BASE_URL}
generation:
temperature: 0.0
max_tokens: 512
num_samples: 1 # Number of judge calls for voting
# Output and cache directories
output_dir: ./runs
cache_dir: ./cache
verbose: falseYAML configs support ${VAR} and ${VAR:default} syntax:
${AZURE_OPENAI_API_KEY}— fails if not set${AZURE_OPENAI_API_KEY:}— empty string if not set${AZURE_OPENAI_BASE_URL:https://default.openai.azure.com/}— uses default if not set
| Config | Description |
|---|---|
configs/quick_test.yaml |
Single PSB benchmark with Azure, for smoke testing |
configs/full_eval.yaml |
All 5 benchmarks with Azure model + Azure judge |
configs/vllm_local.yaml |
Local vLLM model with OpenAI judge |
configs/vllm_azure_judge.yaml |
Local vLLM model with Azure judge |
medriskeval/
├── medriskeval/ # Main package
│ ├── cli/ # Command-line interface
│ ├── config/ # Configuration schemas and YAML loading
│ ├── core/ # Core types (Example, JudgmentResult)
│ ├── datasets/ # Dataset adapters (PSB, MSB, JBB, XSTest, FACTS)
│ ├── metrics/ # Metric computation (refusal, groundedness)
│ ├── models/ # Model backends (OpenAI, Azure, vLLM)
│ ├── prompts/ # Judge prompt builders
│ ├── reporting/ # Result summarization and export
│ └── runner/ # Evaluation pipeline and task orchestration
├── configs/ # YAML configuration files
├── runs/ # Evaluation output directory
├── cache/ # Response and judgment cache
├── setup_datasets.py # Dataset download and setup script
├── pyproject.toml
└── requirements.txt
!THIS REPOSITORY IS FOR RESEARCH PURPOSES!
This benchmark and its outputs are research artefacts only. Results produced by MedRiskEval do not constitute a guarantee of safety, reliability, or fitness for any particular use case. A passing score does not imply that a model is safe for deployment in clinical or patient-facing settings.
Proper red-teaming, domain-expert review, and regulatory compliance assessments should still be carried out before deploying any language model in healthcare environments. MedRiskEval is intended to support — not replace — comprehensive safety evaluation processes.
The authors and contributors assume no liability for decisions made based on benchmark results.
If you use MedRiskEval in your research, please cite:
@inproceedings{corbeil-etal-2026-medriskeval,
title = "{M}ed{R}isk{E}val: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings",
author = "Corbeil, Jean-Philippe and Kim, Minseon and Griot, Maxime and Agarwal, Sheela and Sordoni, Alessandro and Beaulieu, Francois and Vozila, Paul",
booktitle = "Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)",
month = mar,
year = "2026",
address = "Rabat, Morocco",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.eacl-industry.39/",
doi = "10.18653/v1/2026.eacl-industry.39",
pages = "513--524",
}