This GitHub page contains all the datasets and code for the paper titled "Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes".
All the publicly available datasets used for evaluation are in the dataset folder, including all the sythetic clinical notes generated by Phenopacket-Store and the selected cohort of 5,980 clinical notes used for testing. Additionally, we include the pubmed_free_text with 255 literature-derived clinical notes, which is originally compiled in LLM-Gene-Prioritization for this paper.
Phenopacket-derived clinical notes were synthesized by us in ChatGPT with a specific prompting strategy. If you use the Phenopacket-derived clinical notes in your studies, please cite both the Phenopacket paper and our paper.
Wu D, et al. Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes. arXiv, arXiv:2503.12286 [cs.CL], 2025