Skip to content

Latest commit

 

History

History
227 lines (170 loc) · 10.7 KB

readme.md

File metadata and controls

227 lines (170 loc) · 10.7 KB


ChemFM: A Foundation Model for Chemical Design and Property Prediction

Stargazers Forks Issues GitHub License

arxiv ArXiv | Hugging Face Hugging Face | Discord Discord

Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. Contact
  7. Citation
  8. Acknowledgments
  9. License

About The Project

ChemFM is a large-scale foundation model, specifically designed for chemistry. It has been pre-trained on 178 million molecules from UniChem using self-supervised causal language modeling, enabling the extraction of versatile and generalizable molecular representations.

The model comes in two variations with approximately 1 billion and 3 billion trainable parameters:

Pretraining Overview

The model can be fine-tuned for a wide range of downstream chemical tasks, such as:

Pretraining Overview

(back to top)

Getting Started

ChemFM has been tested with Python 3.10 and PyTorch 2.3.0. You can easily set up the required environment using Conda by following these steps:

  • Clone the repository
    git clone https://github.com/TheLuoFengLab/ChemFM.git
    cd ChemFM
  • Create and activate Conda environment
    conda env create -f environment.yml 
    conda activate ChemFM

(back to top)

Usage

Quick Start

To get started with ChemFM, you can load the ChemFM models directly from Hugging Face using the following Python script:

from transformers import AutoModel, AutoTokenizer

# Load the ChemFM-3B model and tokenizer
model_name = "ChemFM/ChemFM-3B"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Pre-training the Model

Pre-training requires significant time and high-performance GPUs due to the scale of both the model and the dataset. For instance, ChemFM-3B took over 20 days on 16 H100 GPUs. For detailed instructions on how to pre-train ChemFM, please refer to the pretraining subfolder.

Fine-tuning the Model

Fine-tuning can typically be performed on a single moderate GPU machine. For detailed instructions on how to fine-tune ChemFM for specific tasks, please refer to the relevant subfolders:

(back to top)

Roadmap

This GitHub project is still under active development. Below is the current roadmap:

If you'd like to request additional features, please submit a feature request in the GitHub Issues section, or feel free to contact us.

(back to top)

Contributing

Any contributions you make are greatly appreciated and can include, but not limited to:

  • New dataset evaluations for existing tasks.
  • Extensions to new task domains in chemistry.

If you have suggestions for improvement, feel free to fork the repository and submit a pull request. You can also open an issue with the "enhancement" tag.

(back to top)

Contact

Main Developer: Feiyang Cai - [email protected]
Project Supervisor: Feng Luo - [email protected]

Join our community on Discord to stay updated or ask questions.

(back to top)

Citation

If you find our work valuable, please consider giving the project a star and citing it in your research:

@article{ChemFM,
      title={A Foundation Model for Chemical Design and Property Prediction}, 
      author={Feiyang Cai and Tianyu Zhu and Tzuen-Rong Tzeng and Yongping Duan and Ling Liu and Srikanth Pilla and Gang Li and Feng Luo},
      year={2024},
      journal = {arXiv preprint arXiv:2410.21422},
}

Thank you for your support!

(back to top)

Acknowledgments

The pre-training of ChemFM is based on TinyLlama, and the fine-tuning process is supported by Hugging Face.

We would also like to thank Clemson University's Palmetto Cluster team for their invaluable support with cloud computing resources and maintenance.

(back to top)

License

This project is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. For more details, please see the LICENSE file.

(back to top)