Skip to content

scikit-fingerprints/MolPILE_dataset

Repository files navigation

MolPILE dataset

A large-scale, diverse and curated dataset for molecular representation learning and pretraining ML models.

ArXiv preprint: "MolPILE - large-scale, diverse dataset for molecular representation learning" J. Adamczyk, J. Poziemski, F. Job, M. Król, M. Makowski

HuggingFace dataset: https://huggingface.co/datasets/scikit-fingerprints/MolPILE

Initial setup

Install:

  • Python 3.11
  • uv
  • make
  • aria2
  • ripgrep
  • unzip

Then, run make setup.

Running pipelines

From terminal, run python main_molpile.py.

If you want to use PyCharm Run command, make sure you turn on Emulate terminal option in run configuration. This will make sure that outputs in the console are properly rendered.

Training Mol2Vec

Note that this will require a lot of RAM (at least ~300 GB) and CPU cores (takes ~24h on 128 cores).

First, create the MolPILE dataset.

Create corpus of ECFP invariants texts:

python mol2vec/create_corpus.py

Train Mol2Vec embeddings:

python mol2vec/train.py

Training ChemBERTa

Note that this will require a lot of RAM (at least ~100 GB) and GPU memory. It also takes a long time, with tokenization taking ~8h.

First, create the MolPILE dataset.

Train the tokenizer:

python chemberta/train_tokenizer.py

Tokenize dataset:

python chemberta/tokenize_dataset.py

Train the ChemBERTa MLM model:

python chemberta/train_mlm.py

Evaluating models

MoleculeNet and TDC datasets are downloaded automatically with scikit-fingerprints. ApisTox is included as small files. WelQrate datasets need to be downloaded from the official website and put into chemberta/welqrate_datasets (CSV files) and chemberta/welqrate_datasets/scaffold_split_idxs (.pt files). Also do the same for Mol2vec.

Then run appropriate benchmarks from project root, e.g. python chemberta/benchmark_apistox.py.

About

MolPILE - large-scale, diverse dataset for molecular representation learning, https://arxiv.org/abs/2509.18353, https://huggingface.co/datasets/scikit-fingerprints/MolPILE

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published