Add open-i-summarization environment for radiology report summarization#102
Add open-i-summarization environment for radiology report summarization#102jbdel wants to merge 3 commits intoMedARC-AI:mainfrom
Conversation
Implements a new evaluation environment for summarizing radiology findings into impressions, based on the Open-I chest X-ray dataset. Features: - LLM-as-judge evaluation with multi-dimensional scoring (accuracy, completeness, conciseness) following the medicationqa pattern - Automatic NLG metrics: BLEU, ROUGE (1/2/L/Lsum), BERTScore - Dataset uploaded to HuggingFace: medarc/open-i-summarization - Metrics aligned with Van Veen et al., Nature Medicine 2024 Files added: - environments/open_i_summarization/open_i_summarization.py - environments/open_i_summarization/pyproject.toml - environments/open_i_summarization/README.md - configs/envs/open_i_summarization.yaml - tests/test_open_i_summarization.py (19 tests)
|
Quick observation: the LLM-as-Judge may be too lenient for summarization. After testing, I noticed the LLM-as-judge consistently scores 5/5 across all dimensions even when the model output is significantly longer than the reference. Example
The judge prompt was adapted from We could adjust the conciseness criteria to penalize outputs significantly longer than the reference, or add add explicit length comparison in the judge prompt. |
|
We are only using MedHELM judge prompts for datasets with MedHELM implementations, so ideally we'd create summarization prompt/rubric or modify an existing summarization prompt/rubric for this medical task. |
- Rename "accuracy" to "correctness" per Nature Medicine terminology - Update judge dimensions to match reader study criteria from Methods section: - Correctness: evaluates precision (penalizes fabricated information) - Completeness: evaluates recall (clinically important detail retained) - Conciseness: evaluates brevity (penalizes superfluous information) - Add explicit length-based scoring guidelines for conciseness - Update tests and README to reflect new dimension names - Add example results showing stricter conciseness scoring Reference: Van Veen et al., Nature Medicine 2024 https://doi.org/10.1038/s41591-024-02855-5 (Methods - Reader study)
|
Thanks for the feedback @warner-benjamin I've updated the judge criteria to match the Nature Medicine (Van Veen et al., Nature Medicine 2024 ) reader study methodology. Renamed dimension: Updated criteria definitions (from Methods - Reader study section):
Added explicit length-based scoring guidelines for conciseness:
Before vs After
Example Output (New Judge) |
- Compute length ratio programmatically and inject into judge prompt - Add explicit scoring thresholds that judge must follow: - 5: ≤1.0x, 4: 1.0-1.5x, 3: 1.5-2.0x, 2: 2.0-3.0x, 1: >3.0x - Store length_metrics in info for analysis - Add strong instruction to prevent content-quality override This ensures conciseness scores strictly follow length ratios.
Summary
Evaluation Approach
medicationqaenvironment patternFiles Added
environments/open_i_summarization/open_i_summarization.pyenvironments/open_i_summarization/pyproject.tomlenvironments/open_i_summarization/README.mdconfigs/envs/open_i_summarization.yamltests/test_open_i_summarization.py(19 tests, all passing)Example Usage & Output
Example Results