Skip to content

Add open-i-summarization environment for radiology report summarization#102

Open
jbdel wants to merge 3 commits intoMedARC-AI:mainfrom
jbdel:open-i-summarization
Open

Add open-i-summarization environment for radiology report summarization#102
jbdel wants to merge 3 commits intoMedARC-AI:mainfrom
jbdel:open-i-summarization

Conversation

@jbdel
Copy link
Copy Markdown
Contributor

@jbdel jbdel commented Jan 22, 2026

Summary

Evaluation Approach

  • LLM-as-Judge: Multi-dimensional scoring (accuracy, completeness, conciseness) on 1-5 scale, following the medicationqa environment pattern
  • Automatic Metrics: BLEU, ROUGE (1/2/L/Lsum), BERTScore - same metrics used in the Nature Medicine paper
  • Prompt: Prompt taken the Nature Medicine paper for fair comparison

Files Added

  • environments/open_i_summarization/open_i_summarization.py
  • environments/open_i_summarization/pyproject.toml
  • environments/open_i_summarization/README.md
  • configs/envs/open_i_summarization.yaml
  • tests/test_open_i_summarization.py (19 tests, all passing)

Example Usage & Output

python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 5 -r 1 --judge-model gpt-4.1-mini -s
--- Evaluation ---
Environment: open_i_summarization
Model: gpt-4.1-mini
Examples: 5
--- Example ---
╭──────────────────────────────────── Step 0 ────────────────────────────────────╮
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ Prompt                         ┃ Completion                      ┃ Reward ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩  │
│ │ system: Summarize the          │ assistant: Normal heart size;   │   1.00 │  │
│ │ radiology report findings into │ no acute pulmonary findings;    │        │  │
│ │ an impression with minimal     │ mediastinal calcification and   │        │  │
│ │ text.                          │ dense right upper lung nodule   │        │  │
│ │                                │ consistent with prior           │        │  │
│ │ user: Heart size within normal │ granulomatous disease.          │        │  │
│ │ limits...                      │                                 │        │  │
│ └────────────────────────────────┴─────────────────────────────────┴────────┘  │
╰────────────────────────────────────────────────────────────────────────────────╯
--- All ---
Rewards:
reward: avg - 1.000, std - 0.000
r1: [1.0, 1.0, 1.0, 1.0, 1.0]

Example Results

=== Example 1 ===
Model output: Normal heart size; no acute pulmonary findings; mediastinal 
              calcification and dense right upper lung nodule consistent 
              with prior granulomatous disease.
Reference: No acute cardiopulmonary findings
Reward: 1.0
LLM-as-Judge Scores:
  accuracy: 5/5
  completeness: 5/5
  conciseness: 5/5
Automatic Metrics:
  BLEU: 0.0000
  ROUGE-1: 0.2500
  ROUGE-L: 0.2500
  BERTScore F1: 0.8507

Implements a new evaluation environment for summarizing radiology findings
into impressions, based on the Open-I chest X-ray dataset.

Features:
- LLM-as-judge evaluation with multi-dimensional scoring (accuracy,
  completeness, conciseness) following the medicationqa pattern
- Automatic NLG metrics: BLEU, ROUGE (1/2/L/Lsum), BERTScore
- Dataset uploaded to HuggingFace: medarc/open-i-summarization
- Metrics aligned with Van Veen et al., Nature Medicine 2024

Files added:
- environments/open_i_summarization/open_i_summarization.py
- environments/open_i_summarization/pyproject.toml
- environments/open_i_summarization/README.md
- configs/envs/open_i_summarization.yaml
- tests/test_open_i_summarization.py (19 tests)
@jbdel
Copy link
Copy Markdown
Contributor Author

jbdel commented Jan 22, 2026

Quick observation: the LLM-as-Judge may be too lenient for summarization.

After testing, I noticed the LLM-as-judge consistently scores 5/5 across all dimensions even when the model output is significantly longer than the reference.

Example

Reference Model Output
Text "No acute cardiopulmonary findings" "Normal heart size; no consolidation, effusion, or edema. Mediastinal calcification and dense right upper lung nodule indicate prior granulomatous disease."
Length 33 chars 154 chars (4.7x longer)
Judge Scores - accuracy 5/5, completeness 5/5, conciseness 5/5

The judge prompt was adapted from medicationqa, which evaluates open-ended Q&A. In Q&A tasks, "conciseness" means "answer directly without rambling." However, summarization has the different goal of compressing information to minimal text.

We could adjust the conciseness criteria to penalize outputs significantly longer than the reference, or add add explicit length comparison in the judge prompt.

@warner-benjamin
Copy link
Copy Markdown
Collaborator

We are only using MedHELM judge prompts for datasets with MedHELM implementations, so ideally we'd create summarization prompt/rubric or modify an existing summarization prompt/rubric for this medical task.

- Rename "accuracy" to "correctness" per Nature Medicine terminology
- Update judge dimensions to match reader study criteria from Methods section:
  - Correctness: evaluates precision (penalizes fabricated information)
  - Completeness: evaluates recall (clinically important detail retained)
  - Conciseness: evaluates brevity (penalizes superfluous information)
- Add explicit length-based scoring guidelines for conciseness
- Update tests and README to reflect new dimension names
- Add example results showing stricter conciseness scoring

Reference: Van Veen et al., Nature Medicine 2024
https://doi.org/10.1038/s41591-024-02855-5 (Methods - Reader study)
@jbdel
Copy link
Copy Markdown
Contributor Author

jbdel commented Jan 22, 2026

Thanks for the feedback @warner-benjamin

I've updated the judge criteria to match the Nature Medicine (Van Veen et al., Nature Medicine 2024 ) reader study methodology.

Renamed dimension: accuracycorrectness (per paper terminology)

Updated criteria definitions (from Methods - Reader study section):

Dimension Definition
Correctness "Which summary includes less false information?" — evaluates precision (penalizes fabricated information)
Completeness "Which summary more completely captures important information?" — evaluates recall (clinically important detail retained)
Conciseness "Which summary contains less non-important information?" — evaluates brevity (penalizes superfluous information)

Added explicit length-based scoring guidelines for conciseness:

  • 5: Similar length or shorter than reference
  • 4: Up to 1.5x longer with clinically relevant content
  • 3: 1.5-2x longer with some unnecessary detail
  • 2: 2-3x longer with substantial superfluous information
  • 1: >3x longer or much irrelevant content

Before vs After

Example Length Ratio Old Score New Score
1 4.7x 5/5 conciseness 1/5 conciseness
2 2.1x 5/5 conciseness 2/5 conciseness
3 1.8x 5/5 conciseness 3/5 conciseness

Example Output (New Judge)

=== Example 1 ===
Reference: "No acute cardiopulmonary findings" (33 chars)
Model: "Normal heart size; no consolidation, effusion, or edema..." (154 chars)
Length ratio: 4.67x
Reward: 0.733
correctness: 5/5 (no false information)
completeness: 5/5 (all key findings captured)
conciseness: 1/5 (4.7x longer than reference, exceeds >3x threshold)
=== Example 2 ===
Reference: "Opacification of the right middle and lower lobes." (50 chars)
Model: "Right middle and lower lobe opacity; mediastinal contours normal..." (106 chars)
Length ratio: 2.12x
Reward: 0.800
correctness: 5/5
completeness: 5/5
conciseness: 2/5 (2.1x falls in 2.0-3.0x range)
=== Example 3 ===
Reference: "1. Negative for acute cardiopulmonary findings." (47 chars)
Model: "No acute pulmonary abnormalities. Normal cardiomediastinal silhouette..." (86 chars)
Length ratio: 1.83x
Reward: 0.867
correctness: 5/5
completeness: 5/5
conciseness: 3/5 (1.8x falls in 1.5-2.0x range)
  conciseness: 4/5 (longer than reference, but clinically relevant)

- Compute length ratio programmatically and inject into judge prompt
- Add explicit scoring thresholds that judge must follow:
  - 5: ≤1.0x, 4: 1.0-1.5x, 3: 1.5-2.0x, 2: 2.0-3.0x, 1: >3.0x
- Store length_metrics in info for analysis
- Add strong instruction to prevent content-quality override

This ensures conciseness scores strictly follow length ratios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants