MolPILE dataset

A large-scale, diverse and curated dataset for molecular representation learning and pretraining ML models.

ArXiv preprint: "MolPILE - large-scale, diverse dataset for molecular representation learning" J. Adamczyk, J. Poziemski, F. Job, M. Król, M. Makowski

HuggingFace dataset: https://huggingface.co/datasets/scikit-fingerprints/MolPILE

Initial setup

Install:

Python 3.11
uv
make
aria2
ripgrep
unzip

Then, run make setup.

Running pipelines

From terminal, run python main_molpile.py.

If you want to use PyCharm Run command, make sure you turn on Emulate terminal option in run configuration. This will make sure that outputs in the console are properly rendered.

Training Mol2Vec

Note that this will require a lot of RAM (at least ~300 GB) and CPU cores (takes ~24h on 128 cores).

First, create the MolPILE dataset.

Create corpus of ECFP invariants texts:

python mol2vec/create_corpus.py

Train Mol2Vec embeddings:

python mol2vec/train.py

Training ChemBERTa

Note that this will require a lot of RAM (at least ~100 GB) and GPU memory. It also takes a long time, with tokenization taking ~8h.

First, create the MolPILE dataset.

Train the tokenizer:

python chemberta/train_tokenizer.py

Tokenize dataset:

python chemberta/tokenize_dataset.py

Train the ChemBERTa MLM model:

python chemberta/train_mlm.py

Evaluating models

MoleculeNet and TDC datasets are downloaded automatically with scikit-fingerprints. ApisTox is included as small files. WelQrate datasets need to be downloaded from the official website and put into chemberta/welqrate_datasets (CSV files) and chemberta/welqrate_datasets/scaffold_split_idxs (.pt files). Also do the same for Mol2vec.

Then run appropriate benchmarks from project root, e.g. python chemberta/benchmark_apistox.py.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
chemberta		chemberta
diverse_subsets		diverse_subsets
mol2vec		mol2vec
src		src
visualizations		visualizations
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main_chembl.py		main_chembl.py
main_gdb.py		main_gdb.py
main_molpile.py		main_molpile.py
main_zinc.py		main_zinc.py
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolPILE dataset

Initial setup

Running pipelines

Training Mol2Vec

Training ChemBERTa

Evaluating models

About

Uh oh!

Releases

Packages

Languages

License

scikit-fingerprints/MolPILE_dataset

Folders and files

Latest commit

History

Repository files navigation

MolPILE dataset

Initial setup

Running pipelines

Training Mol2Vec

Training ChemBERTa

Evaluating models

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages