Chemfuser

Chemfuser is a deep generative model for molecular structure recovery using a diffusion-based architecture. It learns to reconstruct complete SMILES strings from partially masked sequences, enabling chemically valid molecule generation via denoising.

This project is implemented in PyTorch and is designed for flexibility, performance, and chemical validity during training.

Note: This repository is an improved and extended version of Chemfuser on Hugging Face, developed by the same author.

Features

Transformer-based diffusion model for SMILES denoising
Custom vocabulary and forward masking scheduler
Automatic handling of valid SMILES evaluation during sampling
Multi-GPU support with PyTorch DataParallel
Optional AMP + torch.compile() training for performance
Integrated experiment tracking via Weights & Biases (wandb)

Installation

Clone the repository:

git clone https://github.com/Kelu01/Chemfuser.git
cd chem_fuser

Install dependencies:

pip install -r requirements.txt

Training

Standard Training (with optional DataParallel)

python train.py

Optimized Training (AMP + torch.compile)

python train_amp.py

Project Layout and Directory Overview

The following directories and files contain the key components of the Chemfuser project:

model.py – Core model implementation, including the transformer-based diffusion architecture and sampling logic.
vocab.py – SMILES vocabulary management and tokenization functions.
train.py – Main training script supporting both single-GPU and multi-GPU training using DataParallel.
train_amp.py – Optional training script that uses Automatic Mixed Precision (AMP) and torch.compile() for optimized performance on supported hardware.
data/ – Input resources:
- voc.txt – Vocabulary tokens used for SMILES tokenization.
- canonical_smiles.txt – Dataset of SMILES strings (one per line).
checkpoints/ – Directory where model weights are saved during training.
inference/ – Inference pipeline and related files:
- inference.ipynb – Notebook for evaluating the trained model on masked SMILES.
- process.py – Utility script for preparing input and running inference.
- test_set.txt, masked_test_set.txt, inference_data.csv – Inference input and output files.
preprocessing/ – Scripts for preparing the dataset and vocabulary:
- create_dataset.py – Processes raw SMILES into training-ready format.
- create_voc.py – Builds the vocabulary file from the dataset.
LICENSE – MIT license file.
README.md – Project overview, usage guide, and documentation.
requirements.txt – Python dependencies required to run the project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chemfuser

Features

Installation

Training

Project Layout and Directory Overview

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
__pycache__		__pycache__
checkpoints		checkpoints
data		data
inference		inference
preprocessing		preprocessing
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
train_amp.py		train_amp.py
vocab.py		vocab.py

Folders and files

Latest commit

History

Repository files navigation

Chemfuser

Features

Installation

Training

Project Layout and Directory Overview

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages