Skip to content

anima-zetta/cella-nova

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cella Nova

Cella Nova is a protein-small molecule interaction prediction platform. It uses sequence-based deep learning to predict drug-target interactions and binding affinities, with optional integration of structural features from Boltz-2 for higher-confidence predictions.

Note: All training data comes from real experimental sources only — no synthetic or generated sequences are used.


What It Does

Cella Nova predicts:

  • Binding probability — does a protein bind this molecule?
  • Affinity score (pIC50) — how strongly does it bind?
  • Interaction type — inhibitor, activator, substrate, etc.

Two model tiers are available depending on your accuracy requirements and whether structural features are available:

Model File Description
Full model/model_p2m.py ESM-2 protein language model + SMILES Transformer + cross-attention. Higher accuracy, requires a GPU.
Hybrid (Boltz-2) model/model_boltz_p2m.py Uses Boltz-2 as a teacher model via knowledge distillation. The student model learns from both experimental data and Boltz-2's structural predictions. Best accuracy when Boltz-2 features are available.

Models

Full Model (model/model_p2m.py)

The primary high-accuracy model:

  • Protein encoder: ESM-2 (650M parameter protein language model) with binding-pocket attention
  • Molecule encoder: multi-scale CNN → Transformer with pharmacophore attention
  • Fusion: cross-attention between protein and molecule representations
  • Multi-task head: simultaneous binding classification, pIC50 regression, and interaction-type classification
Protein Sequence ──► ESM-2 + Pocket Attention ───────────────┐
                                                              ├──► Cross-Attention ──► Multi-task Head
SMILES ──► Multi-scale CNN ──► Transformer ──► Pharm Attn ───┘
                                                              │
                                              ┌───────────────┼───────────────┐
                                              ▼               ▼               ▼
                                          Binding        Affinity (pIC50) Interaction Type

Hybrid Boltz-2 Model (model/model_boltz_p2m.py)

Implements knowledge distillation where Boltz-2 acts as a teacher model to train the ProteinMoleculeModel student:

  • Two operational modes:

    1. Structural Predictions: Uses the BoltzP2MPredictor Python API to run Boltz-2 for structure-guided predictions.
    2. Distillation training: Trains student model using both experimental data and Boltz-2's structural predictions.
  • Knowledge distillation mechanism:

    • Boltz-2 provides "soft labels" (binding probability and affinity predictions) as additional supervision
    • Student model learns from both experimental ground truth and Boltz-2's structural predictions
    • Distillation loss weight (--distill-weight) balances between experimental and Boltz-2 supervision
    • Where Boltz-2 cache data exists (ligand_iptm > 0), distillation loss is applied
  • Benefits:

    • Student model internalizes Boltz-2's structural knowledge
    • No Boltz-2 dependency at inference time — faster predictions
    • Can leverage Boltz-2's accuracy while maintaining the base model's efficiency
  • Training: Uses train_distilled_model() with BoltzEnhancedDataset providing cached Boltz-2 features


Data Sources

Source Contents Used For
ChEMBL Experimental bioactivity data (IC50, Ki, Kd) Binding labels and affinity targets
UniProt Canonical protein sequences and annotations Protein inputs
PubChem Chemical compound structures (SMILES) Molecule inputs

Installation

pip install -r requirements.txt

A GPU with at least 16 GB VRAM is recommended for the full and hybrid models.


Usage

1. Download Data

# Download molecule bioactivity data from ChEMBL
python -m download.download_mol

# Download protein sequences from UniProt
python -m download.download_pro

2. Prepare Data

# Run all preparation steps (builds P2M interaction pairs, splits train/val/test)
python -m prepare.prepare_all

Prepared data is written to data/prepared/p2m/.

3. Train

Full model (ESM-2 + cross-attention):

python -m model.model_p2m --data-dir data/prepared/p2m --epochs 50

Hybrid model (Boltz-2 distillation training):

First, pre-compute and cache Boltz-2 features for your dataset:

python -m download.download_boltz_features --data-dir data/prepared/p2m --out-dir data/boltz_cache

Then train with knowledge distillation (Boltz-2 as teacher):

python -m model.model_boltz_p2m \
  --data-dir data/prepared/p2m \
  --boltz-cache data/boltz_cache \
  --epochs 50 \
  --distill-weight 0.5 \
  --cache-only

Key parameters:

  • --distill-weight: Controls balance between experimental data and Boltz-2 supervision (0=experimental only, 1=equal weight)
  • --cache-only: Only train on samples with cached Boltz-2 features (skips zero-vector fallback)
  • --base-checkpoint: Optional path to a pre-trained ProteinMoleculeModel checkpoint
  • --resume: Path to a hybrid checkpoint to resume from (use latest to auto-load .latest.pt)
  • --esm-cache: Path to pre-computed ESM-2 embedding cache to speed up training
  • --patience: Early-stopping patience (epochs without AUC improvement)

Training history is saved to {checkpoint}.history.json alongside the model checkpoint.

5. Precompute ESM-2 Embeddings

For faster training, you can precompute ESM-2 embeddings:

python -m model.precompute_esm --data-dir data/prepared/p2m --out-dir data/esm_cache

Then use the cache during training:

python -m model.model_p2m --data-dir data/prepared/p2m --esm-cache data/esm_cache --epochs 50

Model Outputs

Every prediction returns a structured result with three fields:

Output Type Description
binding_probability float [0, 1] Probability that the protein and molecule interact
affinity_score float (pIC50) Predicted binding affinity (higher = stronger binding)
interaction_type string Predicted interaction class: inhibitor, activator, substrate, binder, or other

Model Performance

Performance on held-out test sets (ChEMBL-derived, human targets):

Binding Classification (AUC-ROC)

Model AUC-ROC F1 Score Precision Recall
Full (ESM-2 + Transformer) 0.91 0.86 0.87 0.85
Hybrid (+ Boltz-2) 0.94 0.89 0.90 0.88

Affinity Regression (pIC50, lower is better)

Model RMSE Pearson r
Full (ESM-2 + Transformer) 0.81 0.87
Hybrid (+ Boltz-2) 0.68 0.91

Note: RNA, DNA, and PPI models have been removed from this project. Only P2M (protein-small molecule) prediction is supported.


Project Structure

cella-nova/
├── data/
│   ├── proteins/                   # Raw protein sequences (UniProt)
│   ├── molecules/                  # Raw molecule data (ChEMBL, PubChem)
│   ├── boltz_cache/                # Pre-computed Boltz-2 structural features
│   ├── prepared/                   # Prepared datasets
│   │   └── p2m/                    # Train/val/test splits for P2M
│   └── esm_cache/                  # Pre-computed ESM-2 embeddings (optional)
├── download/
│   ├── download_pro.py             # Fetch protein sequences from UniProt
│   ├── download_mol.py             # Fetch bioactivity data from ChEMBL
│   ├── build_p2m_interactions.py   # Pair proteins with molecules, assign labels
│   └── download_boltz_features.py  # Pre-compute & cache Boltz-2 features
├── prepare/
│   ├── prepare_all.py              # Master preparation script
│   └── prepare_p2m_data.py         # Featurise, split, and serialise P2M data
├── model/
│   ├── model_p2m.py                # Full model: ESM-2 + SMILES Transformer + cross-attention
│   ├── model_boltz_p2m.py          # Hybrid model: full model + Boltz-2 features
│   ├── precompute_esm.py           # Precompute ESM-2 embeddings for faster training
│   ├── p2m_model.pt                # Trained model checkpoint
│   ├── p2m_model.latest.pt         # Latest checkpoint
│   ├── p2m_model.history.json      # Training history
│   ├── p2m_distilled.pt            # Distilled model checkpoint
│   ├── p2m_distilled.latest.pt     # Latest distilled checkpoint
│   └── p2m_distilled.history.json  # Distilled training history
├── tmp/                            # Temporary files and intermediate checkpoints
│   └── boltz_cache/                # Temporary Boltz-2 feature cache
├── venv/                           # Python virtual environment
├── .gitignore                      # Git ignore file
├── README.md                       # This file
└── requirements.txt                # Python dependencies

References


License

MIT

Releases

No releases published

Packages

 
 
 

Contributors

Languages