A large-scale, diverse and curated dataset for molecular representation learning and pretraining ML models.
ArXiv preprint: "MolPILE - large-scale, diverse dataset for molecular representation learning" J. Adamczyk, J. Poziemski, F. Job, M. Król, M. Makowski
HuggingFace dataset: https://huggingface.co/datasets/scikit-fingerprints/MolPILE
Install:
- Python 3.11
- uv
- make
- aria2
- ripgrep
- unzip
Then, run make setup.
From terminal, run python main_molpile.py.
If you want to use PyCharm Run command, make sure you turn on Emulate terminal
option in run configuration. This will make sure that outputs in the console are
properly rendered.
Note that this will require a lot of RAM (at least ~300 GB) and CPU cores (takes ~24h on 128 cores).
First, create the MolPILE dataset.
Create corpus of ECFP invariants texts:
python mol2vec/create_corpus.py
Train Mol2Vec embeddings:
python mol2vec/train.py
Note that this will require a lot of RAM (at least ~100 GB) and GPU memory. It also takes a long time, with tokenization taking ~8h.
First, create the MolPILE dataset.
Train the tokenizer:
python chemberta/train_tokenizer.py
Tokenize dataset:
python chemberta/tokenize_dataset.py
Train the ChemBERTa MLM model:
python chemberta/train_mlm.py
MoleculeNet and TDC datasets are downloaded automatically with scikit-fingerprints.
ApisTox is included as small files. WelQrate datasets need to be downloaded from
the official website and put into chemberta/welqrate_datasets
(CSV files) and chemberta/welqrate_datasets/scaffold_split_idxs (.pt files).
Also do the same for Mol2vec.
Then run appropriate benchmarks from project root, e.g. python chemberta/benchmark_apistox.py.