Skip to content

Transform VQA‑RAD into a multi‑modal, explainable medical‑QA mini‑corpus (speech ✚ bounding box ✚ reasoning). (TAAI 2025)

License

Notifications You must be signed in to change notification settings

whats2000/MedVoiceQAReasonDataset

Repository files navigation

MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset

Official Implementation of the paper:
"MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset: A Comprehensive Multimodal Medical AI Dataset with Speech, Visual Localization, and Explainable Reasoning"
📄 Poster at The 30th TAAI Conference | 📎 Paper (PDF)

Transform VQA‑RAD into a multi‑modal, explainable medical‑QA mini‑corpus (speech ✚ bounding box ✚ reasoning)


📝 ToDo

  • Implement the annotation pipeline using LangGraph
  • Implement the human verification UI
  • Publish the workshop paper for the pipeline (For AgentX competition)
  • Run the full 300-sample pipeline
  • Release the paper reported dataset. Look at section 3.5 · (Optional) Download Pre-generated Raw Data for details.
  • Publish the full detailed paper
  • Cooperate with medical institutions to validate the dataset
  • Publish the dataset on Hugging Face

⭐️ What’s inside?

Modality Fields Source models/tools
Image image (PNG) VQA‑RAD DICOM → PNG via dicom2png
Speech speech_path (WAV) · asr_text Bark (TTS) → large-v3 (whisper) (ASR)
Visual loc. visual_box gemini-2.5-flash
Reasoning text_explanation · uncertainty gemini-2.5-flash
QA flag needs_review · critic_notes gemini-2.5-flash

Size: 300 samples covering CT/MRI/X‑ray, stratified by modality & question type. (Number may increase after discussion with medical institutions)


🗺️ Pipeline (LangGraph)

flowchart TD
    START([START]) --> Loader[Loader Node<br/>Load VQA-RAD sample<br/>DICOM → PNG conversion]
    
    Loader --> |"image_path<br/>text_query<br/>metadata"| Segmentation[Segmentation Node<br/>Visual localization<br/>Gemini Vision bbox detection]
    Loader --> |"text_query<br/>sample_id"| ASR_TTS[ASR/TTS Node<br/>Bark TTS synthesis<br/>Whisper ASR validation]
    
    Segmentation --> |"visual_box"| Explanation[Explanation Node<br/>Reasoning generation<br/>Uncertainty estimation<br/>Gemini Language]
    ASR_TTS --> |"speech_path<br/>asr_text<br/>quality_score"| Explanation
    
    Explanation --> |"text_explanation<br/>uncertainty"| Validation[Validation Node<br/>Quality assessment<br/>Error detection<br/>Review flagging]
    
    Validation --> |"needs_review<br/>critic_notes<br/>quality_scores"| Pipeline_END([PIPELINE END])
    
    Pipeline_END -.-> |"Post-processing"| Human_UI[Human Verification UI<br/>Streamlit interface<br/>Sample review & approval<br/>Quality control]
    
    Human_UI --> Dataset[Final Dataset<br/>Validated samples<br/>Ready for publication]
    
    %% Styling
    classDef nodeStyle fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef startEnd fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px
    classDef humanProcess fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,stroke-dasharray: 5 5
    classDef dataOutput fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    
    class START,Pipeline_END startEnd
    class Loader,Segmentation,ASR_TTS,Explanation,Validation nodeStyle
    class Human_UI humanProcess
    class Dataset dataOutput
Loading

📊 Processing Details

Stage Concurrency Input Output Models/Tools
Loader Sequential sample_id image_path, text_query, metadata Hugging Face Dataset loader
Segmentation Parallel image_path, text_query visual_box Gemini 2.5 Flash
ASR/TTS Parallel text_query, sample_id speech_path, asr_text, speech_quality_score Bark TTS + Whisper-L ASR
Explanation Sequential All prior outputs text_explanation, uncertainty Gemini 2.5 Flash
Validation Sequential All outputs + errors needs_review, critic_notes, quality_scores Gemini 2.5 Flash
Human Review Manual Validated samples Final dataset Streamlit UI interface

Key Feature: Segmentation and ASR/TTS nodes run in parallel after the Loader, reducing total processing time by ~40%.

🔄 Each node appends versioning metadata (node_name, node_version) for full provenance tracking.


🚀 Quick Start

1 · Clone & install with uv

Note

If you have not installed uv, please do so first: https://docs.astral.sh/uv/getting-started/installation/

git clone https://github.com/whats2000/MedVoiceQAReasonDataset.git
cd MedVoiceQAReasonDataset

# Check CUDA version
nvidia-smi
# It should show something like this:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 560.94                 Driver Version: 560.94         CUDA Version: 12.6     |
# |-----------------------------------------+------------------------+----------------------+

# Install with uv (Please pick the right one for your CUDA version)
uv sync --extra cpu
# Or if you using cuda 11.8
uv sync --extra cu118 
# Or if you using cuda 12.6
uv sync --extra cu126
# Or if you using cuda 12.8
uv sync --extra cu128

2 · Prepare secrets

Create an .env file with your Gemini & Hugging Face keys (see .env.example):

3. Download VQA‑RAD index

uv run .\data\huggingface_loader.py

3.5 · (Optional) Download Pre-generated Raw Data

If you want to skip running the full pipeline and use our pre-generated data directly, you can download it from Google Drive:

  1. Download the data from: Google Drive Link
  2. Extract the downloaded archive to the runs/ folder

After extraction, your folder structure should look like:

runs/
├── 20250528_113033-35f94652/
│   ├── manifest.json
│   ├── results.json
│   └── audio/
├── current/
│   ├── hf_data/
│   │   └── images/
│   └── images/

Tip

This is the same data reported in our paper. It allows you to explore the dataset, run the statistics notebook, and use the Human Verification UI without needing to run the full pipeline.

4 · Verify installation

uv run pytest

Outputs land in runs/<timestamp>-<hash>/ with manifest.json for reproducibility.

5 · Dry‑run on 50 samples

uv run python pipeline/run_pipeline.py --limit 50

6 · Full 300‑sample run

uv run python pipeline/run_pipeline.py

7 · Human verification via UI

After processing, review the generated data through the web interface:

# Install UI dependencies
uv sync --extra ui

# Launch the verification interface
uv run medvoice-ui

The interface opens at http://localhost:8501 where you can:

  • Review generated images, audio, and explanations
  • Approve/reject samples for the final dataset
  • Mark quality issues and add review notes
  • Export validated dataset for publication

🏗️ Repo layout

.
├── pipeline/          # Python graph definition (LangGraph API)
│   └── run_pipeline.py
├── nodes/                    # one folder per Node (Loader, Segmentation, …)
├── data/                     # sampling scripts & raw VQA‑RAD index
│   └── huggingface_loader.py # data loader for VQA‑RAD
├── ui/                       # Human verification web interface
│   ├── review_interface.py   # Streamlit app for sample review
│   ├── launch.py            # UI launcher script
│   └── README.md            # UI documentation
├── runs/                     # immutable artefacts  (git‑ignored)
├── tests/                    # pytest script
└── README.md                 # this file

📝 Node Contracts

Node Consumes Produces
Loader sample_id image_path, text_query
Segmentation image_path, text_query visual_box
ASR / TTS text_query speech_path, asr_text, speech_quality_score
Explanation image_path, text_query, visual_box text_explanation, uncertainty
Validation all prior keys needs_review, critic_notes

Each Node appends node_name and node_version for full provenance.


🔄 Update Models in Four Steps

  1. Train or fine‑tune the new model.
  2. Wrap it to match the Node I/O JSON schema.
  3. Edit run_pipeline.py to use the new version.
  4. Re‑run tests; if metrics pass → merge.

📜 License & Citation

  • Code: MIT
  • Derived data: CC‑BY 4.0 (VQA‑RAD is CC0 1.0; please cite their paper.)

Note

The paper is to be published at The 30th TAAI Conference. We will update the citation once the paper is officially released.

@inproceedings{mvvqrad2025,
  title     = {MVVQ-RAD: Medical Voice Vision Question-Reason Answer Dataset: A Comprehensive Multimodal Medical AI Dataset with Speech, Visual Localization, and Explainable Reasoning},
  author    = {Hu, Hsiang-Wei and Wang, Pei-Shan and Wu, Ren-Di and Chen, Li-Ju and Luo, Zih-Jia},
  booktitle = {Proceedings of the 30th Conference on Technologies and Applications of Artificial Intelligence (TAAI)},
  year      = {2025},
  note      = {Poster, Official Publication Pending}
}

✨ Acknowledgements

  • VQA‑RAD authors for the base dataset.
  • Open‑source medical‑AI community for Whisper‑L, Bark, LangGraph, and Gemini credits.

About

Transform VQA‑RAD into a multi‑modal, explainable medical‑QA mini‑corpus (speech ✚ bounding box ✚ reasoning). (TAAI 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published