-
Notifications
You must be signed in to change notification settings - Fork 35
CareQA env #33 #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
CareQA env #33 #48
Changes from 6 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
8201912
Add careqa mcq eval environment
Arya-Hari 9868f31
Merge branch 'main' of https://github.com/MedARC-AI/med-lm-envs
Arya-Hari a19ba97
add careqa open-ended env
Arya-Hari ec93a0f
add careqa open-ended env
Arya-Hari 61eed34
resolving issues
Arya-Hari f311014
removing redundant imports
Arya-Hari f9f321a
resolving comments
Arya-Hari 8a04ba9
resolving commits
Arya-Hari f0018ab
resolving comments
Arya-Hari b5da66a
Update careqa_openended.py
Arya-Hari 013b0d6
Update careqa_openended.py
Arya-Hari d76f860
Update careqa_openended.py
Arya-Hari f412b30
Merge branch 'main' into pr/Arya-Hari/48
warner-benjamin 3682c60
update careqa to use No Free Labels style prompt
warner-benjamin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # careqa | ||
|
|
||
| Evaluation environment for the [HPAI-BSC/CareQA](https://huggingface.co/datasets/HPAI-BSC/CareQA) multiple-choice dataset. | ||
|
|
||
| ### Overview | ||
| - **Environment ID**: `careqa_mcq` | ||
| - **Short description**: CareQA is a healthcare QA dataset with **multiple-choice** and **open-ended clinical reasoning questions**. This environment is for the MCQs only. | ||
| - **Tags**: healthcare, medical QA, clinical reasoning, MCQ, single-turn | ||
|
|
||
| ### Datasets | ||
| - **Primary dataset(s)**: | ||
| - `CareQA_en` – multiple-choice clinical questions with 4 options and correct answer labels. | ||
| - **Source links**: | ||
| - [Hugging Face CareQA dataset](https://huggingface.co/datasets/HPAI-BSC/CareQA) | ||
|
|
||
| ### Task | ||
| - **Type**: single-turn | ||
| - **Parser**: custom prompt mapping (no structured markup) | ||
| - **Rubric overview**: | ||
| **MCQ (`closed_mcq`)**: `vf.Rubric()` measuring **accuracy** (letter match). | ||
|
|
||
| ### Quickstart | ||
| Run an evaluation with default settings: | ||
|
|
||
| ```bash | ||
| uv run vf-eval careqa | ||
| ``` | ||
|
|
||
| Configure model and sampling: | ||
|
|
||
| ```bash | ||
| uv run vf-eval careqa_mcq --model gpt-4.1-mini --num-examples 3 -s | ||
| ``` | ||
|
|
||
| ### Metrics | ||
|
|
||
| | Metric | Meaning | | ||
| |---------------|---------| | ||
| | `reward` | Main scalar reward (weighted sum of rubric criteria) | | ||
| | `accuracy` | Exact match on target MCQ answer (letter A–D) | | ||
|
|
||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| from __future__ import annotations | ||
| from typing import Any, Optional | ||
| from datasets import load_dataset | ||
| import verifiers as vf | ||
|
|
||
|
|
||
| # Helper Functions | ||
|
|
||
| def _get_text_from_completion(completion: Any) -> str: | ||
| """Extract plain text from completion.""" | ||
| if isinstance(completion, str): | ||
| return completion.strip() | ||
| if isinstance(completion, list) and completion: | ||
| last = completion[-1] | ||
| if isinstance(last, dict): | ||
| return str(last.get("content", "")).strip() | ||
| return str(last).strip() | ||
| return str(completion).strip() | ||
|
|
||
|
|
||
| def _first_letter(text: str) -> Optional[str]: | ||
| """Extract the first uppercase A–Z letter.""" | ||
| for ch in (text or "").upper(): | ||
| if "A" <= ch <= "Z": | ||
| return ch | ||
| return None | ||
|
|
||
| # Prompt Construction | ||
|
|
||
| def _build_prompt(question: str, options: dict[str, str]) -> str: | ||
| """Create an MCQ prompt.""" | ||
| formatted_opts = "\n".join(f"{k}. {v}" for k, v in options.items()) | ||
| letters = ", ".join(options.keys()) | ||
| return ( | ||
| "You are a board-certified clinician taking a medical reasoning test.\n" | ||
| "Read the following question carefully and choose the most appropriate answer.\n\n" | ||
| f"Question:\n{question.strip()}\n\n" | ||
| f"Options:\n{formatted_opts}\n\n" | ||
| f"Respond with only the option letter ({letters}), nothing else." | ||
| ) | ||
|
|
||
| # Main Environment | ||
|
|
||
| def load_environment(split: str = "test") -> vf.Environment: | ||
| """ | ||
| CareQA multiple-choice evaluation environment. | ||
| Uses vf.SingleTurnEnv + MCQ accuracy rubric. | ||
| """ | ||
| ds = load_dataset("HPAI-BSC/CareQA",'CareQA_en', split=split) | ||
|
|
||
| def _map(ex): | ||
| options = {"A": ex["op1"], "B": ex["op2"], "C": ex["op3"], "D": ex["op4"]} | ||
| gold_letter = ["A", "B", "C", "D"][ex["cop"] - 1] | ||
| # The key change is here: format the single prompt string as a list of dicts (ChatML format) | ||
| return { | ||
| "prompt": [ | ||
| { | ||
| "role": "user", | ||
| "content": _build_prompt(ex["question"], options) | ||
| } | ||
| ], | ||
| "answer": gold_letter, | ||
| } | ||
|
|
||
| mapped = ds.map(_map, remove_columns=ds.column_names) | ||
|
|
||
| def mcq_accuracy(completion, answer): | ||
| pred = _first_letter(_get_text_from_completion(completion)) | ||
| return 1.0 if pred == str(answer).upper() else 0.0 | ||
|
|
||
| rubric = vf.Rubric(funcs=[mcq_accuracy], weights=[1.0]) | ||
|
|
||
| return vf.SingleTurnEnv( | ||
| dataset=mapped, | ||
| eval_dataset=mapped, | ||
| rubric=rubric, | ||
| system_prompt=None, | ||
| ) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| [project] | ||
| name = "careqa_mcq" | ||
| description = "Evaluation environment for the HPAI-BSC/CareQA MCQ dataset" | ||
| tags = ["healthcare", "medical-qa", "mcq", "clinical", "single-turn"] | ||
| version = "0.1.0" | ||
| requires-python = ">=3.11" | ||
| dependencies = [ | ||
| "verifiers>=0.1.4", | ||
| "datasets>=2.13.0" | ||
| ] | ||
|
|
||
| [tool.prime.environment] | ||
| loader = "careqa_mcq:load_environment" | ||
| display_name = "CareQA" | ||
| visibility = "PUBLIC" | ||
|
|
||
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
|
|
||
| [tool.hatch.build] | ||
| include = ["careqa_mcq.py"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # careqa_openended | ||
|
|
||
| Evaluation environment for the [HPAI-BSC/CareQA](https://huggingface.co/datasets/HPAI-BSC/CareQA) openended dataset. | ||
|
|
||
| ### Overview | ||
| - **Environment ID**: `careqa_openended` | ||
| - **Short description**: CareQA is a healthcare QA dataset with **multiple-choice** and **open-ended clinical reasoning questions**. This environment is for the open-ended questions only. | ||
| - **Tags**: healthcare, medical QA, clinical reasoning, single-turn | ||
|
|
||
| ### Datasets | ||
| - **Primary dataset(s)**: | ||
| - `CareQA_en_open` – open-ended clinical questions with reference answers. | ||
| - **Source links**: | ||
| - [Hugging Face CareQA dataset](https://huggingface.co/datasets/HPAI-BSC/CareQA) | ||
|
|
||
| ### Task | ||
| - **Type**: single-turn | ||
| - **Parser**: custom prompt mapping (no structured markup) | ||
| - **Rubric overview**: | ||
| **Open-ended (`open_clinical`)**: `vf.JudgeRubric()` using an LLM-as-judge to score free-text answers for correctness and clinical reasoning. | ||
|
|
||
| ### Quickstart | ||
| Run an evaluation with default settings: | ||
|
|
||
| ```bash | ||
| uv run vf-eval careqa | ||
| ``` | ||
|
|
||
| Configure model and sampling: | ||
|
|
||
| ```bash | ||
| uv run vf-eval careqa_openended --model gpt-4.1-mini --num-examples 3 -s | ||
| ``` | ||
|
|
||
| ### Metrics | ||
|
|
||
| | Metric | Meaning | | ||
| |---------------|---------| | ||
| | `reward` | Main scalar reward (weighted sum of rubric criteria) | | ||
| | `judge_score` | For open-ended questions, LLM-assigned score evaluating answer quality, correctness, and clinical reasoning | | ||
|
|
||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| from __future__ import annotations | ||
| from datasets import load_dataset | ||
| import verifiers as vf | ||
|
|
||
| # Load Open-Ended Environment | ||
|
|
||
| def load_environment(split: str = "test") -> vf.SingleTurnEnv: | ||
| ds = load_dataset("HPAI-BSC/CareQA", 'CareQA_en_open', split=split) | ||
|
|
||
| def _map(ex): | ||
| system_content = "You are an expert clinician answering medical questions." | ||
|
|
||
| user_content = ( | ||
| "Read the following question carefully and provide a detailed, concise answer.\n\n" | ||
| f"Question:\n{ex['question'].strip()}\n\n" | ||
| "Answer:" | ||
| ) | ||
|
|
||
| return { | ||
| "prompt": [ | ||
| {"role": "system", "content": system_content}, | ||
| {"role": "user", "content": user_content}, | ||
| ], | ||
| "answer": ex.get("answer_explanation", ex.get("answer", "")), | ||
| } | ||
|
|
||
| mapped = ds.map(_map, remove_columns=ds.column_names) | ||
|
|
||
| rubric = vf.JudgeRubric() | ||
warner-benjamin marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| return vf.SingleTurnEnv( | ||
| dataset=mapped, | ||
| eval_dataset=mapped, | ||
| rubric=rubric, | ||
| system_prompt=None, | ||
| ) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| [project] | ||
| name = "careqa_openended" | ||
| description = "Evaluation environment for the HPAI-BSC/CareQA open-ended dataset" | ||
| tags = ["healthcare", "medical-qa", "open-ended", "clinical", "single-turn"] | ||
| version = "0.1.0" | ||
| requires-python = ">=3.11" | ||
| dependencies = [ | ||
| "verifiers>=0.1.4", | ||
| "datasets>=2.13.0" | ||
| ] | ||
|
|
||
| [tool.prime.environment] | ||
| loader = "careqa_openended:load_environment" | ||
| display_name = "CareQA" | ||
| visibility = "PUBLIC" | ||
|
|
||
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
|
|
||
| [tool.hatch.build] | ||
| include = ["careqa_openended.py"] |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.