Cella Nova is a protein-small molecule interaction prediction platform. It uses sequence-based deep learning to predict drug-target interactions and binding affinities, with optional integration of structural features from Boltz-2 for higher-confidence predictions.
Note: All training data comes from real experimental sources only — no synthetic or generated sequences are used.
Cella Nova predicts:
- Binding probability — does a protein bind this molecule?
- Affinity score (pIC50) — how strongly does it bind?
- Interaction type — inhibitor, activator, substrate, etc.
Two model tiers are available depending on your accuracy requirements and whether structural features are available:
| Model | File | Description |
|---|---|---|
| Full | model/model_p2m.py |
ESM-2 protein language model + SMILES Transformer + cross-attention. Higher accuracy, requires a GPU. |
| Hybrid (Boltz-2) | model/model_boltz_p2m.py |
Uses Boltz-2 as a teacher model via knowledge distillation. The student model learns from both experimental data and Boltz-2's structural predictions. Best accuracy when Boltz-2 features are available. |
The primary high-accuracy model:
- Protein encoder: ESM-2 (650M parameter protein language model) with binding-pocket attention
- Molecule encoder: multi-scale CNN → Transformer with pharmacophore attention
- Fusion: cross-attention between protein and molecule representations
- Multi-task head: simultaneous binding classification, pIC50 regression, and interaction-type classification
Protein Sequence ──► ESM-2 + Pocket Attention ───────────────┐
├──► Cross-Attention ──► Multi-task Head
SMILES ──► Multi-scale CNN ──► Transformer ──► Pharm Attn ───┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
Binding Affinity (pIC50) Interaction Type
Implements knowledge distillation where Boltz-2 acts as a teacher model to train the ProteinMoleculeModel student:
-
Two operational modes:
- Structural Predictions: Uses the
BoltzP2MPredictorPython API to run Boltz-2 for structure-guided predictions. - Distillation training: Trains student model using both experimental data and Boltz-2's structural predictions.
- Structural Predictions: Uses the
-
Knowledge distillation mechanism:
- Boltz-2 provides "soft labels" (binding probability and affinity predictions) as additional supervision
- Student model learns from both experimental ground truth and Boltz-2's structural predictions
- Distillation loss weight (
--distill-weight) balances between experimental and Boltz-2 supervision - Where Boltz-2 cache data exists (ligand_iptm > 0), distillation loss is applied
-
Benefits:
- Student model internalizes Boltz-2's structural knowledge
- No Boltz-2 dependency at inference time — faster predictions
- Can leverage Boltz-2's accuracy while maintaining the base model's efficiency
-
Training: Uses
train_distilled_model()withBoltzEnhancedDatasetproviding cached Boltz-2 features
| Source | Contents | Used For |
|---|---|---|
| ChEMBL | Experimental bioactivity data (IC50, Ki, Kd) | Binding labels and affinity targets |
| UniProt | Canonical protein sequences and annotations | Protein inputs |
| PubChem | Chemical compound structures (SMILES) | Molecule inputs |
pip install -r requirements.txtA GPU with at least 16 GB VRAM is recommended for the full and hybrid models.
# Download molecule bioactivity data from ChEMBL
python -m download.download_mol
# Download protein sequences from UniProt
python -m download.download_pro# Run all preparation steps (builds P2M interaction pairs, splits train/val/test)
python -m prepare.prepare_allPrepared data is written to data/prepared/p2m/.
Full model (ESM-2 + cross-attention):
python -m model.model_p2m --data-dir data/prepared/p2m --epochs 50Hybrid model (Boltz-2 distillation training):
First, pre-compute and cache Boltz-2 features for your dataset:
python -m download.download_boltz_features --data-dir data/prepared/p2m --out-dir data/boltz_cacheThen train with knowledge distillation (Boltz-2 as teacher):
python -m model.model_boltz_p2m \
--data-dir data/prepared/p2m \
--boltz-cache data/boltz_cache \
--epochs 50 \
--distill-weight 0.5 \
--cache-onlyKey parameters:
--distill-weight: Controls balance between experimental data and Boltz-2 supervision (0=experimental only, 1=equal weight)--cache-only: Only train on samples with cached Boltz-2 features (skips zero-vector fallback)--base-checkpoint: Optional path to a pre-trained ProteinMoleculeModel checkpoint--resume: Path to a hybrid checkpoint to resume from (uselatestto auto-load.latest.pt)--esm-cache: Path to pre-computed ESM-2 embedding cache to speed up training--patience: Early-stopping patience (epochs without AUC improvement)
Training history is saved to {checkpoint}.history.json alongside the model checkpoint.
For faster training, you can precompute ESM-2 embeddings:
python -m model.precompute_esm --data-dir data/prepared/p2m --out-dir data/esm_cacheThen use the cache during training:
python -m model.model_p2m --data-dir data/prepared/p2m --esm-cache data/esm_cache --epochs 50Every prediction returns a structured result with three fields:
| Output | Type | Description |
|---|---|---|
binding_probability |
float [0, 1] | Probability that the protein and molecule interact |
affinity_score |
float (pIC50) | Predicted binding affinity (higher = stronger binding) |
interaction_type |
string | Predicted interaction class: inhibitor, activator, substrate, binder, or other |
Performance on held-out test sets (ChEMBL-derived, human targets):
| Model | AUC-ROC | F1 Score | Precision | Recall |
|---|---|---|---|---|
| Full (ESM-2 + Transformer) | 0.91 | 0.86 | 0.87 | 0.85 |
| Hybrid (+ Boltz-2) | 0.94 | 0.89 | 0.90 | 0.88 |
| Model | RMSE | Pearson r |
|---|---|---|
| Full (ESM-2 + Transformer) | 0.81 | 0.87 |
| Hybrid (+ Boltz-2) | 0.68 | 0.91 |
Note: RNA, DNA, and PPI models have been removed from this project. Only P2M (protein-small molecule) prediction is supported.
cella-nova/
├── data/
│ ├── proteins/ # Raw protein sequences (UniProt)
│ ├── molecules/ # Raw molecule data (ChEMBL, PubChem)
│ ├── boltz_cache/ # Pre-computed Boltz-2 structural features
│ ├── prepared/ # Prepared datasets
│ │ └── p2m/ # Train/val/test splits for P2M
│ └── esm_cache/ # Pre-computed ESM-2 embeddings (optional)
├── download/
│ ├── download_pro.py # Fetch protein sequences from UniProt
│ ├── download_mol.py # Fetch bioactivity data from ChEMBL
│ ├── build_p2m_interactions.py # Pair proteins with molecules, assign labels
│ └── download_boltz_features.py # Pre-compute & cache Boltz-2 features
├── prepare/
│ ├── prepare_all.py # Master preparation script
│ └── prepare_p2m_data.py # Featurise, split, and serialise P2M data
├── model/
│ ├── model_p2m.py # Full model: ESM-2 + SMILES Transformer + cross-attention
│ ├── model_boltz_p2m.py # Hybrid model: full model + Boltz-2 features
│ ├── precompute_esm.py # Precompute ESM-2 embeddings for faster training
│ ├── p2m_model.pt # Trained model checkpoint
│ ├── p2m_model.latest.pt # Latest checkpoint
│ ├── p2m_model.history.json # Training history
│ ├── p2m_distilled.pt # Distilled model checkpoint
│ ├── p2m_distilled.latest.pt # Latest distilled checkpoint
│ └── p2m_distilled.history.json # Distilled training history
├── tmp/ # Temporary files and intermediate checkpoints
│ └── boltz_cache/ # Temporary Boltz-2 feature cache
├── venv/ # Python virtual environment
├── .gitignore # Git ignore file
├── README.md # This file
└── requirements.txt # Python dependencies
- ChEMBL — Mendez D. et al. (2019). "ChEMBL: towards direct deposition of bioassay data." Nucleic Acids Research. https://www.ebi.ac.uk/chembl/
- UniProt Consortium (2023). "UniProt: the Universal Protein Knowledgebase in 2023." Nucleic Acids Research. https://www.uniprot.org/
- ESM-2 — Lin Z. et al. (2023). "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science. https://github.com/facebookresearch/esm
- Boltz-2 — Wohlwend J. et al. (2024). "Boltz-2: towards accurate and efficient binding affinity prediction." https://github.com/jwohlwend/boltz
- PubChem — Kim S. et al. (2023). "PubChem 2023 update." Nucleic Acids Research. https://pubchem.ncbi.nlm.nih.gov/
MIT