Official Implementation of the paper:
"MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset: A Comprehensive Multimodal Medical AI Dataset with Speech, Visual Localization, and Explainable Reasoning"
📄 Poster at The 30th TAAI Conference | 📎 Paper (PDF)
Transform VQA‑RAD into a multi‑modal, explainable medical‑QA mini‑corpus (speech ✚ bounding box ✚ reasoning)
- Implement the annotation pipeline using LangGraph
- Implement the human verification UI
- Publish the workshop paper for the pipeline (For AgentX competition)
- Run the full 300-sample pipeline
- Release the paper reported dataset. Look at section 3.5 · (Optional) Download Pre-generated Raw Data for details.
- Publish the full detailed paper
- Cooperate with medical institutions to validate the dataset
- Publish the dataset on Hugging Face
| Modality | Fields | Source models/tools |
|---|---|---|
| Image | image (PNG) |
VQA‑RAD DICOM → PNG via dicom2png |
| Speech | speech_path (WAV) · asr_text |
Bark (TTS) → large-v3 (whisper) (ASR) |
| Visual loc. | visual_box |
gemini-2.5-flash |
| Reasoning | text_explanation · uncertainty |
gemini-2.5-flash |
| QA flag | needs_review · critic_notes |
gemini-2.5-flash |
Size: 300 samples covering CT/MRI/X‑ray, stratified by modality & question type. (Number may increase after discussion with medical institutions)
flowchart TD
START([START]) --> Loader[Loader Node<br/>Load VQA-RAD sample<br/>DICOM → PNG conversion]
Loader --> |"image_path<br/>text_query<br/>metadata"| Segmentation[Segmentation Node<br/>Visual localization<br/>Gemini Vision bbox detection]
Loader --> |"text_query<br/>sample_id"| ASR_TTS[ASR/TTS Node<br/>Bark TTS synthesis<br/>Whisper ASR validation]
Segmentation --> |"visual_box"| Explanation[Explanation Node<br/>Reasoning generation<br/>Uncertainty estimation<br/>Gemini Language]
ASR_TTS --> |"speech_path<br/>asr_text<br/>quality_score"| Explanation
Explanation --> |"text_explanation<br/>uncertainty"| Validation[Validation Node<br/>Quality assessment<br/>Error detection<br/>Review flagging]
Validation --> |"needs_review<br/>critic_notes<br/>quality_scores"| Pipeline_END([PIPELINE END])
Pipeline_END -.-> |"Post-processing"| Human_UI[Human Verification UI<br/>Streamlit interface<br/>Sample review & approval<br/>Quality control]
Human_UI --> Dataset[Final Dataset<br/>Validated samples<br/>Ready for publication]
%% Styling
classDef nodeStyle fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef startEnd fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px
classDef humanProcess fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,stroke-dasharray: 5 5
classDef dataOutput fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
class START,Pipeline_END startEnd
class Loader,Segmentation,ASR_TTS,Explanation,Validation nodeStyle
class Human_UI humanProcess
class Dataset dataOutput
| Stage | Concurrency | Input | Output | Models/Tools |
|---|---|---|---|---|
| Loader | Sequential | sample_id |
image_path, text_query, metadata |
Hugging Face Dataset loader |
| Segmentation | Parallel | image_path, text_query |
visual_box |
Gemini 2.5 Flash |
| ASR/TTS | Parallel | text_query, sample_id |
speech_path, asr_text, speech_quality_score |
Bark TTS + Whisper-L ASR |
| Explanation | Sequential | All prior outputs | text_explanation, uncertainty |
Gemini 2.5 Flash |
| Validation | Sequential | All outputs + errors | needs_review, critic_notes, quality_scores |
Gemini 2.5 Flash |
| Human Review | Manual | Validated samples | Final dataset | Streamlit UI interface |
✨ Key Feature: Segmentation and ASR/TTS nodes run in parallel after the Loader, reducing total processing time by ~40%.
🔄 Each node appends versioning metadata (node_name, node_version) for full provenance tracking.
Note
If you have not installed uv, please do so first:
https://docs.astral.sh/uv/getting-started/installation/
git clone https://github.com/whats2000/MedVoiceQAReasonDataset.git
cd MedVoiceQAReasonDataset
# Check CUDA version
nvidia-smi
# It should show something like this:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 560.94 Driver Version: 560.94 CUDA Version: 12.6 |
# |-----------------------------------------+------------------------+----------------------+
# Install with uv (Please pick the right one for your CUDA version)
uv sync --extra cpu
# Or if you using cuda 11.8
uv sync --extra cu118
# Or if you using cuda 12.6
uv sync --extra cu126
# Or if you using cuda 12.8
uv sync --extra cu128Create an .env file with your Gemini & Hugging Face keys (see .env.example):
uv run .\data\huggingface_loader.pyIf you want to skip running the full pipeline and use our pre-generated data directly, you can download it from Google Drive:
- Download the data from: Google Drive Link
- Extract the downloaded archive to the
runs/folder
After extraction, your folder structure should look like:
runs/
├── 20250528_113033-35f94652/
│ ├── manifest.json
│ ├── results.json
│ └── audio/
├── current/
│ ├── hf_data/
│ │ └── images/
│ └── images/
Tip
This is the same data reported in our paper. It allows you to explore the dataset, run the statistics notebook, and use the Human Verification UI without needing to run the full pipeline.
uv run pytestOutputs land in runs/<timestamp>-<hash>/ with manifest.json for reproducibility.
uv run python pipeline/run_pipeline.py --limit 50uv run python pipeline/run_pipeline.pyAfter processing, review the generated data through the web interface:
# Install UI dependencies
uv sync --extra ui
# Launch the verification interface
uv run medvoice-uiThe interface opens at http://localhost:8501 where you can:
- Review generated images, audio, and explanations
- Approve/reject samples for the final dataset
- Mark quality issues and add review notes
- Export validated dataset for publication
.
├── pipeline/ # Python graph definition (LangGraph API)
│ └── run_pipeline.py
├── nodes/ # one folder per Node (Loader, Segmentation, …)
├── data/ # sampling scripts & raw VQA‑RAD index
│ └── huggingface_loader.py # data loader for VQA‑RAD
├── ui/ # Human verification web interface
│ ├── review_interface.py # Streamlit app for sample review
│ ├── launch.py # UI launcher script
│ └── README.md # UI documentation
├── runs/ # immutable artefacts (git‑ignored)
├── tests/ # pytest script
└── README.md # this file
| Node | Consumes | Produces |
|---|---|---|
| Loader | sample_id |
image_path, text_query |
| Segmentation | image_path, text_query |
visual_box |
| ASR / TTS | text_query |
speech_path, asr_text, speech_quality_score |
| Explanation | image_path, text_query, visual_box |
text_explanation, uncertainty |
| Validation | all prior keys | needs_review, critic_notes |
Each Node appends node_name and node_version for full provenance.
- Train or fine‑tune the new model.
- Wrap it to match the Node I/O JSON schema.
- Edit
run_pipeline.pyto use the new version. - Re‑run tests; if metrics pass → merge.
- Code: MIT
- Derived data: CC‑BY 4.0 (VQA‑RAD is CC0 1.0; please cite their paper.)
Note
The paper is to be published at The 30th TAAI Conference. We will update the citation once the paper is officially released.
@inproceedings{mvvqrad2025,
title = {MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset: A Comprehensive Multimodal Medical AI Dataset with Speech, Visual Localization, and Explainable Reasoning},
author = {Hu, Hsiang-Wei and Wang, Pei-Shan and Wu, Ren-Di and Chen, Li-Ju and Luo, Zih-Jia},
booktitle = {Proceedings of the 30th Conference on Technologies and Applications of Artificial Intelligence (TAAI)},
year = {2025},
note = {Poster, Official Publication Pending}
}- VQA‑RAD authors for the base dataset.
- Open‑source medical‑AI community for Whisper‑L, Bark, LangGraph, and Gemini credits.