Skip to content

Latest commit

 

History

History

property_prediction

Molecular Property Prediction

This section of the repository contains all necessary scripts and details for fine-tuning ChemFM for Molecular Property Prediction tasks.

We provide Hugging Face Hugging Face demos where you can run property prediction on the benchmark datasets.

Overview

Molecular property prediction is one of the key tasks in computational chemistry, where the goal is to predict various properties (e.g., solubility, bioactivity, toxicity) of a molecule based on its SMILES representation.

  • We include comparisons for the benchmark datasets (MoleculeNet and ADMET), along with details to replicate the results reported in the paper. Model checkpoints and configurations are also provided for each dataset.

  • We also provide code for fine-tuning on custom datasets, allowing flexibility for various tasks.

Steps to Fine-tune ChemFM

1. Prepare the Dataset

1.1 MoleculeNet

For MoleculeNet datasets, we use ChemBench, which extracts the exact same datasets (including splitting methods, random seed, and number of folds) as described in the MoleculeNet paper.

To install ChemBench, use the following commands:

git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
pip install -e .

1.2 ADMET

For the ADMET benchmark, we use the TDC (Therapeutics Data Commons) library, which is included in the environment.yml file provided in the main page. Alternatively, you can install it using the following command:

conda install -c conda-forge pytdc

1.3 CustomDataset

To fine-tune ChemFM on a custom dataset, you need to prepare your dataset in a folder with three CSV files: train.csv, val.csv, and test.csv. Each CSV file should include:

  • A "smiles" column for the molecular SMILES strings.
  • One or more label columns for the target property values. For reference, example files are provided in the custom_data_example folder.

You can refer to the Supported Features section to learn more about the types of tasks we support.

(back to top)

2. Configure the Parameters for Training

You can configure the parameters for training in two ways:

  • Feed arguments directly to the Python file: Pass the arguments as command-line parameters when running the training script.
  • Specify the parameters in a YAML file: Define all configurations in a .yml file and pass the file path to the Python script.

We provide an example YAML file along with explanations of the configuration options.

For the MoleculeNet and ADMET benchmark datasets, you can directly use the configuration files stored in configs/admet and configs/moleculenet.

(back to top)

3. Fine-tuning Script

To fine-tune ChemFM, you can use the following command:

python -m accelerate.commands.launch --config_file accelerate_config.yaml main.py --training_args_file <config_yml_file>

Our code is based on the accelerate package, and the accelerate_config.yaml file is used to configure the distribution settings for training across multiple devices.

(back to top)

Benchmark Results

The evaluation of molecular property prediction models can be quite noisy in the community. We have observed several papers and methods that use certain "tricks" to improve their evaluation performance. Our evaluations, however, are based on the following strict standards, which are common in machine learning evaluations but often ignored by some methods:

  • Exact dataset usage: We use the exact same datasets (including the splitting methods, random seeds, and number of folds). This ensures that the training, validation, and test datasets are consistent across evaluations. Different data splits can lead to significant differences in evaluation results, especially when using the scaffold splitting.

  • Training only on the training set: We strictly use only the training data for training the model. Including validation data in the training set can considerably boost performance, especially for datasets with fewer than 1000 data samples.

  • Hyperparameter tuning on the validation set: We tune hyperparameters only based on the validation dataset, not the test dataset. Public test datasets are often leaked, and we’ve observed cases where hyperparameters are optimized directly on test performance, which are not correctly evaluated. For example, when working with the CYP2C9_Substrate_CarbonMangels dataset in the ADMET benchmark, we select hyperparameters and report the performance based on the validation metric (the dot in the green circle) rather than using the test metric (the dot in the red circle).

Pretraining Overview

(back to top)

MoleculeNet Results

As previously mentioned, different data splits can lead to significant variations in evaluation results, particularly when using scaffold splitting.

This comparison is based on the exact same data splits from the original MoleculeNet benchmark, as extracted by ChemBench.

We also compared our results with other methods using different data splits on the MoleculeNet benchmark in our paper.

Important

We exclude methods that are not open-sourced, do not adhere to the standard training and evaluation rules discussed above, or cannot be replicated based on our extensive replication experiments.

Click to expand the MoleculeNet results
Category Dataset Task Metric MoleculeNet (Model) Chemprop MMNB ChemFM-3B
Pharmacokinetic BBBP ROC-AUC ↑ 0.690 (Weave) 0.738 0.739 0.751
Bioactivity BACE ROC-AUC ↑ 0.806 (Weave) - 0.835 0.869
HIV ROC-AUC ↑ 0.763 (GC) 0.776 0.777 0.807
MUV PRC-AUC ↑ 0.109 (Weave) 0.041 0.096 0.135
PCBA PRC-AUC ↑ 0.136 (GC) 0.335 0.276 0.346
Toxicity Tox21 ROC-AUC ↑ 0.829 (GC) 0.851 0.845 0.869
SIDER ROC-AUC ↑ 0.638 (GC) 0.676 0.680 0.709
ClinTox ROC-AUC ↑ 0.832 (Weave) 0.864 0.888 0.918
Physicochemical ESOL RMSE ↓ 0.580 (MPNN) 0.555 0.575 0.516
FreeSolv RMSE ↓ 1.150 (MPNN) 1.075 1.155 0.830
Lipophilicity RMSE ↓ 0.655 (GC) 0.555 0.625 0.545
Molecular Binding PDBbind-Full RMSE ↓ 1.440 (GC) 1.391 0.721 0.697

(back to top)

ADMET Results

If you want to evaluate or submit the results, you can download the trained model checkpoints for each dataset. Then, simply run the following command:

python ./submit_admet.py --model_path <path to the checkpoints> --dataset <dataset_name> --task_type <regression or classification>

Important

We carefully reviewed the published code and excluded certain methods from comparison in the ADMET benchmark because they did not follow the evaluation rules outlined above. The reasons for excluding these methods are detailed in Table S2.5 of the paper.

Click to expand the ADMET results
Category Dataset Task Metric Previous Best ChemFM
Absorption Caco2_Wang MAE ↓ 0.330 ± 0.024 @Chemprop-RDKit 0.322 ± 0.026
Bioavailability_Ma ROC-AUC ↑ 0.672 ± 0.021 @DeepPurpose 0.715 ± 0.011
Lipophilicity_AstraZeneca MAE ↓ 0.467 ± 0.006 @Chemprop-RDKit 0.460 ± 0.006
Solubility_AqSolDB MAE ↓ 0.761 ± 0.025 @Chemprop-RDKit 0.725 ± 0.011
HIA_Hou ROC-AUC ↑ 0.981 ± 0.002 @Chemprop-RDKit 0.984 ± 0.004
Pgp_Broccatelli ROC-AUC ↑ 0.929 ± 0.006 @AttrMasking 0.931 ± 0.003
Distribution BBB_Martins ROC-AUC ↑ 0.897 ± 0.004 @ContextPred 0.908 ± 0.010
PPBR_AZ MAE ↓ 7.788 ± 0.210 @Chemprop 7.505 ± 0.073
VDss_Lombardo Spearman ↑ 0.561 ± 0.025 @DeepPurpose 0.662 ± 0.013
Metabolism CYP2C9_Veith PRC-AUC ↑ 0.777 ± 0.003 @Chemprop-RDKit 0.788 ± 0.005
CYP2D6_Veith PRC-AUC ↑ 0.673 ± 0.007 @Chemprop-RDKit 0.704 ± 0.003
CYP3A4_Veith PRC-AUC ↑ 0.876 ± 0.003 @Chemprop-RDKit 0.878 ± 0.003
CYP2C9_Substrate_CarbonMangels PRC-AUC ↑ 0.400 ± 0.008 @Chemprop-RDKit 0.414 ± 0.027
CYP2D6_Substrate_CarbonMangels PRC-AUC ↑ 0.686 ± 0.031 @Chemprop-RDKit 0.739 ± 0.024
CYP3A4_Substrate_CarbonMangels ROC-AUC ↑ 0.619 ± 0.030 @Chemprop-RDKit 0.654 ± 0.022
Excretion Half_Life_Obach Spearman ↑ 0.329 ± 0.083 @DeepPurpose 0.551 ± 0.020
Clearance_Hepatocyte_AZ Spearman ↑ 0.439 ± 0.026 @ContextPred 0.495 ± 0.030
Clearance_Microsome_AZ Spearman ↑ 0.599 ± 0.025 @Chemprop-RDKit 0.611 ± 0.016
Toxicity LD50_Zhu MAE ↓ 0.606 ± 0.024 @Chemprop 0.541 ± 0.015
hERG ROC-AUC ↑ 0.841 ± 0.020 @DeepPurpose 0.848 ± 0.009
AMES ROC-AUC ↑ 0.850 ± 0.004 @Chemprop-RDKit 0.854 ± 0.007
DILI ROC-AUC ↑ 0.919 ± 0.008 @ContextPred 0.920 ± 0.012

(back to top)

Supported Features

The provided code examples demonstrate how to fine-tune ChemFM, and currently, the code only supports the following fine-tuning techniques and tasks, with more options to be tested and extended.

You are welcome to customize the output head and loss function to adapt to other types of tasks.

Fine-tuning Techniques

  • LoRA (Low-Rank Adaptation)
  • Full-parameters fine-tuning (need to test)

Supported Tasks

  • Single-label regression and classification (binary labels)
  • Multi-label classification (each instance can be assigned to multiple binary labels)
  • Multi-class classification (each instance is assigned to one and only one label; the number of labels is greater than 2)
  • Hybrid classification and regression labels

(back to top)