MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset

Official Implementation of the paper:
"MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset: A Comprehensive Multimodal Medical AI Dataset with Speech, Visual Localization, and Explainable Reasoning"
📄 Poster at The 30th TAAI Conference | 📎 Paper (PDF)

Transform VQA‑RAD into a multi‑modal, explainable medical‑QA mini‑corpus (speech ✚ bounding box ✚ reasoning)

📝 ToDo

Implement the annotation pipeline using LangGraph
Implement the human verification UI
Publish the workshop paper for the pipeline (For AgentX competition)
Run the full 300-sample pipeline
Release the paper reported dataset. Look at section 3.5 · (Optional) Download Pre-generated Raw Data for details.
Publish the full detailed paper
Cooperate with medical institutions to validate the dataset
Publish the dataset on Hugging Face

⭐️ What’s inside?

Modality	Fields	Source models/tools
Image	`image` (PNG)	VQA‑RAD DICOM → PNG via dicom2png
Speech	`speech_path` (WAV) · `asr_text`	Bark (TTS) → large-v3 (whisper) (ASR)
Visual loc.	`visual_box`	gemini-2.5-flash
Reasoning	`text_explanation` · `uncertainty`	gemini-2.5-flash
QA flag	`needs_review` · `critic_notes`	gemini-2.5-flash

Size: 300 samples covering CT/MRI/X‑ray, stratified by modality & question type. (Number may increase after discussion with medical institutions)

🗺️ Pipeline (LangGraph)

flowchart TD
    START([START]) --> Loader[Loader Node<br/>Load VQA-RAD sample<br/>DICOM → PNG conversion]
    
    Loader --> |"image_path<br/>text_query<br/>metadata"| Segmentation[Segmentation Node<br/>Visual localization<br/>Gemini Vision bbox detection]
    Loader --> |"text_query<br/>sample_id"| ASR_TTS[ASR/TTS Node<br/>Bark TTS synthesis<br/>Whisper ASR validation]
    
    Segmentation --> |"visual_box"| Explanation[Explanation Node<br/>Reasoning generation<br/>Uncertainty estimation<br/>Gemini Language]
    ASR_TTS --> |"speech_path<br/>asr_text<br/>quality_score"| Explanation
    
    Explanation --> |"text_explanation<br/>uncertainty"| Validation[Validation Node<br/>Quality assessment<br/>Error detection<br/>Review flagging]
    
    Validation --> |"needs_review<br/>critic_notes<br/>quality_scores"| Pipeline_END([PIPELINE END])
    
    Pipeline_END -.-> |"Post-processing"| Human_UI[Human Verification UI<br/>Streamlit interface<br/>Sample review & approval<br/>Quality control]
    
    Human_UI --> Dataset[Final Dataset<br/>Validated samples<br/>Ready for publication]
    
    %% Styling
    classDef nodeStyle fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef startEnd fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px
    classDef humanProcess fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,stroke-dasharray: 5 5
    classDef dataOutput fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    
    class START,Pipeline_END startEnd
    class Loader,Segmentation,ASR_TTS,Explanation,Validation nodeStyle
    class Human_UI humanProcess
    class Dataset dataOutput

📊 Processing Details

Stage	Concurrency	Input	Output	Models/Tools
Loader	Sequential	`sample_id`	`image_path`, `text_query`, `metadata`	Hugging Face Dataset loader
Segmentation	Parallel	`image_path`, `text_query`	`visual_box`	Gemini 2.5 Flash
ASR/TTS	Parallel	`text_query`, `sample_id`	`speech_path`, `asr_text`, `speech_quality_score`	Bark TTS + Whisper-L ASR
Explanation	Sequential	All prior outputs	`text_explanation`, `uncertainty`	Gemini 2.5 Flash
Validation	Sequential	All outputs + errors	`needs_review`, `critic_notes`, `quality_scores`	Gemini 2.5 Flash
Human Review	Manual	Validated samples	Final dataset	Streamlit UI interface

✨ Key Feature: Segmentation and ASR/TTS nodes run in parallel after the Loader, reducing total processing time by ~40%.

🔄 Each node appends versioning metadata (node_name, node_version) for full provenance tracking.

🚀 Quick Start

1 · Clone & install with uv

Note

If you have not installed uv, please do so first: https://docs.astral.sh/uv/getting-started/installation/

git clone https://github.com/whats2000/MedVoiceQAReasonDataset.git
cd MedVoiceQAReasonDataset

# Check CUDA version
nvidia-smi
# It should show something like this:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 560.94                 Driver Version: 560.94         CUDA Version: 12.6     |
# |-----------------------------------------+------------------------+----------------------+

# Install with uv (Please pick the right one for your CUDA version)
uv sync --extra cpu
# Or if you using cuda 11.8
uv sync --extra cu118 
# Or if you using cuda 12.6
uv sync --extra cu126
# Or if you using cuda 12.8
uv sync --extra cu128

2 · Prepare secrets

Create an .env file with your Gemini & Hugging Face keys (see .env.example):

3. Download VQA‑RAD index

uv run .\data\huggingface_loader.py

3.5 · (Optional) Download Pre-generated Raw Data

If you want to skip running the full pipeline and use our pre-generated data directly, you can download it from Google Drive:

Download the data from: Google Drive Link
Extract the downloaded archive to the runs/ folder

After extraction, your folder structure should look like:

runs/
├── 20250528_113033-35f94652/
│   ├── manifest.json
│   ├── results.json
│   └── audio/
├── current/
│   ├── hf_data/
│   │   └── images/
│   └── images/

Tip

This is the same data reported in our paper. It allows you to explore the dataset, run the statistics notebook, and use the Human Verification UI without needing to run the full pipeline.

4 · Verify installation

uv run pytest

Outputs land in runs/<timestamp>-<hash>/ with manifest.json for reproducibility.

5 · Dry‑run on 50 samples

uv run python pipeline/run_pipeline.py --limit 50

6 · Full 300‑sample run

uv run python pipeline/run_pipeline.py

7 · Human verification via UI

After processing, review the generated data through the web interface:

# Install UI dependencies
uv sync --extra ui

# Launch the verification interface
uv run medvoice-ui

The interface opens at http://localhost:8501 where you can:

Review generated images, audio, and explanations
Approve/reject samples for the final dataset
Mark quality issues and add review notes
Export validated dataset for publication

🏗️ Repo layout

.
├── pipeline/          # Python graph definition (LangGraph API)
│   └── run_pipeline.py
├── nodes/                    # one folder per Node (Loader, Segmentation, …)
├── data/                     # sampling scripts & raw VQA‑RAD index
│   └── huggingface_loader.py # data loader for VQA‑RAD
├── ui/                       # Human verification web interface
│   ├── review_interface.py   # Streamlit app for sample review
│   ├── launch.py            # UI launcher script
│   └── README.md            # UI documentation
├── runs/                     # immutable artefacts  (git‑ignored)
├── tests/                    # pytest script
└── README.md                 # this file

📝 Node Contracts

Node	Consumes	Produces
Loader	`sample_id`	`image_path`, `text_query`
Segmentation	`image_path`, `text_query`	`visual_box`
ASR / TTS	`text_query`	`speech_path`, `asr_text`, `speech_quality_score`
Explanation	`image_path`, `text_query`, `visual_box`	`text_explanation`, `uncertainty`
Validation	all prior keys	`needs_review`, `critic_notes`

Each Node appends node_name and node_version for full provenance.

🔄 Update Models in Four Steps

Train or fine‑tune the new model.
Wrap it to match the Node I/O JSON schema.
Edit run_pipeline.py to use the new version.
Re‑run tests; if metrics pass → merge.

📜 License & Citation

Code: MIT
Derived data: CC‑BY 4.0 (VQA‑RAD is CC0 1.0; please cite their paper.)

Note

The paper is to be published at The 30th TAAI Conference. We will update the citation once the paper is officially released.

@inproceedings{mvvqrad2025,
  title     = {MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset: A Comprehensive Multimodal Medical AI Dataset with Speech, Visual Localization, and Explainable Reasoning},
  author    = {Hu, Hsiang-Wei and Wang, Pei-Shan and Wu, Ren-Di and Chen, Li-Ju and Luo, Zih-Jia},
  booktitle = {Proceedings of the 30th Conference on Technologies and Applications of Artificial Intelligence (TAAI)},
  year      = {2025},
  note      = {Poster, Official Publication Pending}
}

✨ Acknowledgements

VQA‑RAD authors for the base dataset.
Open‑source medical‑AI community for Whisper‑L, Bark, LangGraph, and Gemini credits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset

📝 ToDo

⭐️ What’s inside?

🗺️ Pipeline (LangGraph)

📊 Processing Details

🚀 Quick Start

1 · Clone & install with uv

2 · Prepare secrets

3. Download VQA‑RAD index

3.5 · (Optional) Download Pre-generated Raw Data

4 · Verify installation

5 · Dry‑run on 50 samples

6 · Full 300‑sample run

7 · Human verification via UI

🏗️ Repo layout

📝 Node Contracts

🔄 Update Models in Four Steps

📜 License & Citation

✨ Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
data		data
docs		docs
legacy		legacy
models		models
nodes		nodes
notebook		notebook
paper		paper
pipeline		pipeline
prompts		prompts
scripts		scripts
tests		tests
ui		ui
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ffmpeg.exe		ffmpeg.exe
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

License

whats2000/MedVoiceQAReasonDataset

Folders and files

Latest commit

History

Repository files navigation

MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset

📝 ToDo

⭐️ What’s inside?

🗺️ Pipeline (LangGraph)

📊 Processing Details

🚀 Quick Start

1 · Clone & install with uv

2 · Prepare secrets

3. Download VQA‑RAD index

3.5 · (Optional) Download Pre-generated Raw Data

4 · Verify installation

5 · Dry‑run on 50 samples

6 · Full 300‑sample run

7 · Human verification via UI

🏗️ Repo layout

📝 Node Contracts

🔄 Update Models in Four Steps

📜 License & Citation

✨ Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages