Add open-i-summarization environment for radiology report summarization by jbdel · Pull Request #102 · MedARC-AI/med-lm-envs

jbdel · 2026-01-22T05:40:35Z

Summary

Implements a new evaluation environment for radiology findings → impression summarization
Dataset uploaded to HuggingFace: https://huggingface.co/datasets/medarc/open-i-summarization
Based on the Open-I dataset from Van Veen et al., Nature Medicine 2024 - "Adapted large language models can outperform medical experts in clinical text summarization"

Evaluation Approach

LLM-as-Judge: Multi-dimensional scoring (accuracy, completeness, conciseness) on 1-5 scale, following the medicationqa environment pattern
Automatic Metrics: BLEU, ROUGE (1/2/L/Lsum), BERTScore - same metrics used in the Nature Medicine paper
Prompt: Prompt taken the Nature Medicine paper for fair comparison

Files Added

environments/open_i_summarization/open_i_summarization.py
environments/open_i_summarization/pyproject.toml
environments/open_i_summarization/README.md
configs/envs/open_i_summarization.yaml
tests/test_open_i_summarization.py (19 tests, all passing)

Example Usage & Output

python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 5 -r 1 --judge-model gpt-4.1-mini -s

--- Evaluation ---
Environment: open_i_summarization
Model: gpt-4.1-mini
Examples: 5
--- Example ---
╭──────────────────────────────────── Step 0 ────────────────────────────────────╮
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ Prompt                         ┃ Completion                      ┃ Reward ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩  │
│ │ system: Summarize the          │ assistant: Normal heart size;   │   1.00 │  │
│ │ radiology report findings into │ no acute pulmonary findings;    │        │  │
│ │ an impression with minimal     │ mediastinal calcification and   │        │  │
│ │ text.                          │ dense right upper lung nodule   │        │  │
│ │                                │ consistent with prior           │        │  │
│ │ user: Heart size within normal │ granulomatous disease.          │        │  │
│ │ limits...                      │                                 │        │  │
│ └────────────────────────────────┴─────────────────────────────────┴────────┘  │
╰────────────────────────────────────────────────────────────────────────────────╯
--- All ---
Rewards:
reward: avg - 1.000, std - 0.000
r1: [1.0, 1.0, 1.0, 1.0, 1.0]

Example Results

=== Example 1 ===
Model output: Normal heart size; no acute pulmonary findings; mediastinal 
              calcification and dense right upper lung nodule consistent 
              with prior granulomatous disease.
Reference: No acute cardiopulmonary findings
Reward: 1.0
LLM-as-Judge Scores:
  accuracy: 5/5
  completeness: 5/5
  conciseness: 5/5
Automatic Metrics:
  BLEU: 0.0000
  ROUGE-1: 0.2500
  ROUGE-L: 0.2500
  BERTScore F1: 0.8507

Implements a new evaluation environment for summarizing radiology findings into impressions, based on the Open-I chest X-ray dataset. Features: - LLM-as-judge evaluation with multi-dimensional scoring (accuracy, completeness, conciseness) following the medicationqa pattern - Automatic NLG metrics: BLEU, ROUGE (1/2/L/Lsum), BERTScore - Dataset uploaded to HuggingFace: medarc/open-i-summarization - Metrics aligned with Van Veen et al., Nature Medicine 2024 Files added: - environments/open_i_summarization/open_i_summarization.py - environments/open_i_summarization/pyproject.toml - environments/open_i_summarization/README.md - configs/envs/open_i_summarization.yaml - tests/test_open_i_summarization.py (19 tests)

jbdel · 2026-01-22T05:52:03Z

Quick observation: the LLM-as-Judge may be too lenient for summarization.

After testing, I noticed the LLM-as-judge consistently scores 5/5 across all dimensions even when the model output is significantly longer than the reference.

Example

	Reference	Model Output
Text	"No acute cardiopulmonary findings"	"Normal heart size; no consolidation, effusion, or edema. Mediastinal calcification and dense right upper lung nodule indicate prior granulomatous disease."
Length	33 chars	154 chars (4.7x longer)
Judge Scores	-	accuracy 5/5, completeness 5/5, conciseness 5/5

The judge prompt was adapted from medicationqa, which evaluates open-ended Q&A. In Q&A tasks, "conciseness" means "answer directly without rambling." However, summarization has the different goal of compressing information to minimal text.

We could adjust the conciseness criteria to penalize outputs significantly longer than the reference, or add add explicit length comparison in the judge prompt.

warner-benjamin · 2026-01-22T07:50:12Z

We are only using MedHELM judge prompts for datasets with MedHELM implementations, so ideally we'd create summarization prompt/rubric or modify an existing summarization prompt/rubric for this medical task.

- Rename "accuracy" to "correctness" per Nature Medicine terminology - Update judge dimensions to match reader study criteria from Methods section: - Correctness: evaluates precision (penalizes fabricated information) - Completeness: evaluates recall (clinically important detail retained) - Conciseness: evaluates brevity (penalizes superfluous information) - Add explicit length-based scoring guidelines for conciseness - Update tests and README to reflect new dimension names - Add example results showing stricter conciseness scoring Reference: Van Veen et al., Nature Medicine 2024 https://doi.org/10.1038/s41591-024-02855-5 (Methods - Reader study)

jbdel · 2026-01-22T17:52:25Z

Thanks for the feedback @warner-benjamin

I've updated the judge criteria to match the Nature Medicine (Van Veen et al., Nature Medicine 2024 ) reader study methodology.

Renamed dimension: accuracy → correctness (per paper terminology)

Updated criteria definitions (from Methods - Reader study section):

Dimension	Definition
Correctness	"Which summary includes less false information?" — evaluates precision (penalizes fabricated information)
Completeness	"Which summary more completely captures important information?" — evaluates recall (clinically important detail retained)
Conciseness	"Which summary contains less non-important information?" — evaluates brevity (penalizes superfluous information)

Added explicit length-based scoring guidelines for conciseness:

5: Similar length or shorter than reference
4: Up to 1.5x longer with clinically relevant content
3: 1.5-2x longer with some unnecessary detail
2: 2-3x longer with substantial superfluous information
1: >3x longer or much irrelevant content

Before vs After

Example	Length Ratio	Old Score	New Score
1	4.7x	5/5 conciseness	1/5 conciseness
2	2.1x	5/5 conciseness	2/5 conciseness
3	1.8x	5/5 conciseness	3/5 conciseness

Example Output (New Judge)

=== Example 1 ===
Reference: "No acute cardiopulmonary findings" (33 chars)
Model: "Normal heart size; no consolidation, effusion, or edema..." (154 chars)
Length ratio: 4.67x
Reward: 0.733
correctness: 5/5 (no false information)
completeness: 5/5 (all key findings captured)
conciseness: 1/5 (4.7x longer than reference, exceeds >3x threshold)
=== Example 2 ===
Reference: "Opacification of the right middle and lower lobes." (50 chars)
Model: "Right middle and lower lobe opacity; mediastinal contours normal..." (106 chars)
Length ratio: 2.12x
Reward: 0.800
correctness: 5/5
completeness: 5/5
conciseness: 2/5 (2.1x falls in 2.0-3.0x range)
=== Example 3 ===
Reference: "1. Negative for acute cardiopulmonary findings." (47 chars)
Model: "No acute pulmonary abnormalities. Normal cardiomediastinal silhouette..." (86 chars)
Length ratio: 1.83x
Reward: 0.867
correctness: 5/5
completeness: 5/5
conciseness: 3/5 (1.8x falls in 1.5-2.0x range)
  conciseness: 4/5 (longer than reference, but clinically relevant)

- Compute length ratio programmatically and inject into judge prompt - Add explicit scoring thresholds that judge must follow: - 5: ≤1.0x, 4: 1.0-1.5x, 3: 1.5-2.0x, 2: 2.0-3.0x, 1: >3.0x - Store length_metrics in info for analysis - Add strong instruction to prevent content-quality override This ensures conciseness scores strictly follow length ratios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add open-i-summarization environment for radiology report summarization#102

Add open-i-summarization environment for radiology report summarization#102
jbdel wants to merge 3 commits intoMedARC-AI:mainfrom
jbdel:open-i-summarization

jbdel commented Jan 22, 2026 •

edited

Loading

Uh oh!

jbdel commented Jan 22, 2026

Uh oh!

warner-benjamin commented Jan 22, 2026

Uh oh!

jbdel commented Jan 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jbdel commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evaluation Approach

Files Added

Example Usage & Output

Example Results

Uh oh!

jbdel commented Jan 22, 2026

Example

Uh oh!

warner-benjamin commented Jan 22, 2026

Uh oh!

jbdel commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before vs After

Example Output (New Judge)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jbdel commented Jan 22, 2026 •

edited

Loading

jbdel commented Jan 22, 2026 •

edited

Loading