Skip to content

Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Notifications You must be signed in to change notification settings

PennShenLab/FREEFORM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FREEFORM: Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling

This repository holds the official code for the paper Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models.

alt text

🎯 Abstract

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub.

📝 Requiremnets

The algorithm is implemented in Python. To install the related packages, use

conda env create -f environment.yml
conda activate freeform

🔨 Usage

To use our framework, look to the demonstration.ipynb notebook for an example pipeline of the functions defined in utils.py. To replicate our results, you may refer to the notebooks with evaluation in the filename.

🤝 Acknowledgements

This work was supported in part by the NIH grants U01 AG066833, U01 AG068057, R01 AG071470, U19 AG074879, and S10 OD023495.

📭 Maintainers

📚 Citation

@article{FreeForm,
      title={Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models}, 
      author={Joseph Lee and Shu Yang and Jae Young Baik and Xiaoxi Liu and Zhen Tan and Dawei Li and Zixuan Wen and Bojian Hou and Duy Duong-Tran and Tianlong Chen and Li Shen},
      year={2024},
      journal={arXiv preprint arXiv:2410.01795},
}

About

Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published