awesome-machine-learning-for-healthcare

Welcome to my personal repository, a curated collection of cutting-edge research at the intersection of machine learning and healthcare. As an AI researcher with a strong interest in healthcare applications, I've compiled this repository to showcase innovative works mostly in natural language processing (NLP) and multimodal learning within the healthcare domain. While this collection reflects my personal research focus, it aims to serve as a valuable resource for anyone passionate about leveraging machine learning for healthcare. I welcome contributions and discussions, so feel free to share ideas or suggest papers!

Large Language Models

(2023/11) Meditron-70b: Scaling medical pretraining for large language models [paper]
(2024/04) Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare [paper]
(2024/04) Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks [paper]
(2024/01) Health-LLM: Large language models for health prediction via wearable sensor data [paper]
(2022/03) MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering [paper]
(2023/07) Med-HALT: Medical Domain Hallucination Test for Large Language Models [paper]
(2024/01) K-QA: A Real-World Medical Q&A Benchmark [paper]
(2024/05) MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain [paper]

Medical Agent

Synthetic Data Generation

(2017/03) Generating Multi-label Discrete Patient Records using Generative Adversarial Networks [paper]
(2010/10) Data-driven approach for creating synthetic electronic medical records [paper]
(2023/03) EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models [paper]
(2023/04) Synthesize High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model [paper]
(2023/08) EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records [paper]
LLMSYN: Generating Synthetic Electronic Health Records Without Patient-Level Data [paper]

Data Representation and Predictive Modeling

(2022/07) GenHPF: General Healthcare Predictive Framework with Multi-task Multi-source Learning [paper]
(2024/02) REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models [paper]
(2024/06) EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling [paper]
(2024/07) EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models [paper]

Multimodal Representation Learning

(2022/07) MedFuse: Multi-modal fusion with clinical time-series data and chest X-ray images [paper]
(2023/05) Learning Missing Modal Electronic Health Records with Unified Multi-modal Data Embedding and Modality-Aware Attention [paper]
(2024/06) From Basic to Extra Features: Hypergraph Transformer Pretrain-then-Finetuning for Balanced Clinical Predictions on EHR [paper]
(2024/06) FlexCare: Leveraging Cross-Task Synergy for Flexible Multimodal Healthcare Prediction [paper]
(2024/07) MEDFuse: Multimodal EHR Data Fusion with Masked Lab-Test Modeling and Large Language Models [paper]
Multimodal Patient Representation Learning with Missing Modalities and Labels [paper]

Toward a Natural Language Interface for EHRs

Fact Checking

(2020/10) Explainable Automated Fact-Checking for Public Health Claims (EMNLP 2020) [paper] [code]
(2021) Evidence-based Fact-Checking of Health-related Claims (Findings of EMNLP 2021) [paper] [code]
(2024) HealthFC: Verifying Health Claims with Evidence-Based Medical Fact-Checking (LREC-COLING 2024) [paper] [code]
(2024) DOSSIER: Fact Checking in Electronic Health Records while Preserving Patient Privacy (MLHC 2024) [paper] [code]
(2024/06) EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records (NeurIPS 2024) [arxiv] [code] [physionet]
(2024/11) FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models (CVPR 2025) [arxiv] [code]
(2025/01) VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records (Preprint) [arxiv] [code] [physionet]

Medical Imaging

Medical Imaging Datasets

(2019/12) MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports [paper]
(2019/01) MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs [paper]
(2023/10) Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge [paper]
(2024/03) A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities [paper]
(2024/04) RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [paper]
(2024/06) Shadow and Light: Digitally Reconstructed Radiographs for Disease Classification [paper]
(2024/08) MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine [paper]

Radiology Report Generation

Multimodal Large Language Models (MLLMs)

(2024/09) MediConfusion: Can You Trust Your AI Radiologist? Probing the Reliability of Multimodal Medical Foundation Models (ICLR 2025) [arxiv]
(2024/10) MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models (ICLR 2025) [arxiv]
(2025/04) GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning (Preprint) [arxiv]
(2025/04) Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence (Preprint) [arxiv]

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-machine-learning-for-healthcare

Table of Contents

Large Language Models

Medical Agent

Synthetic Data Generation

Data Representation and Predictive Modeling

Multimodal Representation Learning

Toward a Natural Language Interface for EHRs

Fact Checking

Medical Imaging

Medical Imaging Datasets

Radiology Report Generation

Multimodal Large Language Models (MLLMs)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

awesome-machine-learning-for-healthcare

Table of Contents

Large Language Models

Medical Agent

Synthetic Data Generation

Data Representation and Predictive Modeling

Multimodal Representation Learning

Toward a Natural Language Interface for EHRs

Fact Checking

Medical Imaging

Medical Imaging Datasets

Radiology Report Generation

Multimodal Large Language Models (MLLMs)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages