- π Overview
- π Quick Setup
- ποΈ CpGCorpus
- π Model Zoo
- π§ͺ Tutorials
- π§ Finetuning
- β FAQ
- π Citation
- βοΈ Contact
- π License
CpGPT is a foundation model for DNA methylation, trained on genome-wide DNA methylation data. It can generate, impute, and embed methylation profiles, and can be finetuned for various downstream tasks.
- Python 3.10+
- Poetry
- AWS CLI (for downloading dependencies)
We recommend using poetry
for installation:
# Clone the repository
git clone https://github.com/lcamillo/CpGPT.git
cd CpGPT
# Install poetry if not available
pip install poetry
# Install dependencies with Poetry
poetry install
Alternatively, the package is available through:
# Install with pip
pip install CpGPT
Our pre-trained models and data are stored in AWS S3. If you do not already have an AWS account setup, follow these steps:
1. Create an AWS Account
- Go to AWS Console and click "Create an AWS Account" in the top right
- Follow the signup process:
- Provide email and account name
- Enter your personal/business information
- Add payment information (a credit card is required, but the downloads follow free tier limits)
- Complete identity verification (you'll receive a phone call or text)
- Select a support plan (Free tier is sufficient)
2. Install the AWS CLI
For Linux/macOS:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
For Windows:
- Download the AWS CLI MSI installer
- Run the downloaded MSI installer and follow the on-screen instructions
Verify installation:
aws --version
3. Create Access Keys
- Log in to the AWS Console
- Click on your account name in the top right, then "Security credentials"
- Scroll down to "Access keys" and click "Create access key"
- Select "Command Line Interface (CLI)" as the use case
- Check the "I understand..." acknowledgment and click "Next"
- IMPORTANT: Download the CSV file or copy both the "Access key ID" and "Secret access key" to a secure location. You will not be able to view the secret access key again.
4. Configure AWS CLI
Run the following command and enter your credentials when prompted:
aws configure
You'll need to input:
- AWS Access Key ID: The access key ID from step 3
- AWS Secret Access Key: The secret access key from step 3
- Default region name: Enter
us-east-1
(where our data is hosted) - Default output format: Enter
json
5. Test Your Configuration
Verify your setup with this command that lists the contents (without downloading):
aws s3 ls s3://cpgpt-lucascamillo-public/data/cpgcorpus/raw/ --requester-payer requester
You should see a list of GSE folders if your configuration is correct.
Download the Full Corpus
To download the entire CpGCorpus from our S3 bucket, run the following command:
aws s3 sync s3://cpgpt-lucascamillo-public/data/cpgcorpus/raw ./data/cpgcorpus/raw --requester-payer requester
Directory Layout
The CpGCorpus is organized in a hierarchical structure by GSE (Gene Series) and further by GPL (Platform). Below is an overview of the directory layout and file contents:
cpgcorpus/
βββ raw/
βββ {GSE_ID}/
βββ {GPL_ID}/
βββ betas/
β βββ QCDPB.arrow # Processed beta values via the R sesame QCDPB pipeline
β βββ gse_betas.arrow # Raw beta values downloaded from GEO
βββ metadata/
βββ metadata.arrow # Metadata and sample annotations
- The "betas" folder contains one of the two files:
- QCDPB.arrow: Processed data from the R sesame QCDPB pipeline.
- gse_betas.arrow: Beta values as originally downloaded from GEO.
- The "metadata" folder stores the metadata.arrow file that holds supplementary experimental details.
Supported Methylation Platforms
The corpus includes multiple platforms:
- GPL8490 (27k array)
- GPL13534 (450k)
- GPL18809 (450k)
- GPL21145 (EPIC)
- GPL23976 (EPIC)
- GPL29753 (EPIC)
- GPL33022 (EPICv2)
- GPL34394 (MSA)
Download a specific sample
To download a specific dataset (for example, GSE163839 using platform GPL13534), run:
aws s3 cp s3://cpgpt-lucascamillo-public/data/cpgcorpus/raw/GSE163839/GPL13534/betas/QCDPB.arrow ./data/GSE163839.arrow --requester-payer requester
There are several versions of CpGPT, mainly divided into pretrained and finetuned models. Below, you can find a table with a summary of such versions including the model name for download.
Pre-trained Models
Model | Size | Parameters | Description | Model Name |
---|---|---|---|---|
CpGPT-2M | 30MB | ~2.5M | Lightweight model for quick experimentation and resource-constrained environments | small |
CpGPT-100M | 1.1GB | ~101M | Full-size model for state-of-the-art performance and high accuracy | large |
Fine-tuned Models
β οΈ Note: Fine-tuned model weights are currently being updated and will be available soon. The table below shows the models that will be provided.
We provide specialized pre-trained models for common tasks:
Model | Parameters | Description | Output | Model Name |
---|---|---|---|---|
CpGPT-2M-Age | ~2.9M | Multi-tissue chronological age predictor | Age in years | age |
CpGPT-2M-AverageAdultWeight | ~2.9M | Multi-tissue, pan-mammalian weight predictor | Log1p of average adult weight in kilograms | average_adultweight |
CpGPT-100M-BoA | ~101M | EPICv2 blood imputation | No phenotype is predicted | boa |
CpGPT-2M-Cancer | ~2.9M | Multi-tissue cancer predictor | Logits of cancer status (use sigmoid to get probabilities) | cancer |
CpGPT-2M-ClockProxies | ~3.1M | Blood proxies of five epigenetic clocks | altumage, dunedinpace (x100), grimage2, hrsinchphenoage, pchorvath2013 | clock_proxies |
CpGPT-2M-EpicMammal | ~2.5M | Blood EPIC-Mammalian array converter | No phenotype is predicted | epicvmammal |
CpGPT-100M-Hannum | ~101M | 450k blood imputation | No phenotype is predicted | hannum |
CpGPT-100M-HumanRRBSAtlas | ~101M | Multi-tissue RRBS imputation | No phenotype is predicted | human_rrbs_atlas |
CpGPT-100M-Mammalian | ~101M | Multi-tissue, pan-mammalian mammalian array imputation | No phenotype is predicted | mammalian |
CpGPT-2M-MaxLifespan | ~2.9M | Multi-tissue, pan-mammalian max lifespan predictor | Log1p of max lifespan in years | maximum_lifespan |
CpGPT-2M-Mortality | ~2.9M | Blood mortality predictor. Please use strict_load=False. | Risk score | mortality |
CpGPT-2M-RelativeAge | ~2.9M | Multi-tissue, pan-mammalian relative age predictor | Relative age (0 to 1) | relative_age |
CpGPT-100M-sciMETv3 | ~101M | Brain, single-cell imputation | No phenotype is predicted | scimetv3 |
More tutorials will be added soon!
β οΈ Warning: Fine-tuning CpGPT models requires a GPU. The training process is computationally intensive and will be extremely slow or may fail entirely without GPU acceleration. We recommend at least 8GB of VRAM for the small model and 24GB+ for the large model.
Getting Started
-
Download dependencies if you have not already done so by following the steps in the quick setup tutorial notebook.
-
Prepare your data by following the steps in the quick setup tutorial notebook.
Configuration
-
Create a configuration file by modifying template in
configs/experiment/
. -
Run fine-tuning with the CLI:
cpgpt-train experiment=template
- Get the best checkpoint in the logs folders:
- Checkpoint weights:
logs/experiment/{experiment_name}/checkpoints/{experiment_name}.ckpt
- Model config:
logs/experiment/{experiment_name}/.hydra/config.yaml
π Model Configuration
CpGPT provides several parameters to customize your model architecture and training process:
Parameter | Description | Examples |
---|---|---|
model/net |
Model architecture size | small.yaml , large.yaml |
model/optimizer |
Optimization algorithm | adamw.yaml , adamwscheduleree.yaml , lion.yaml |
model/scheduler |
Learning rate scheduler | cosine_warmup.yaml , constant.yaml |
π Task-Specific Settings
Modify these parameters in your experiment YAML file to customize the model for different tasks:
model:
training:
# Type of loss function for condition decoder
condition_decoder_loss: mae # Options: mae, mse, ce
# Weighting for the condition loss vs reconstruction
loss_weights:
condition_loss: 0.1
optimizer:
# Learning rate
lr: 0.0001
net:
# Enable the condition decoder for prediction tasks
use_condition_decoder: true
# Number of target variables to predict
condition_size: 1 # 1 for regression, can be >1 for multi-target
βοΈ Training Parameters
Control the training process with these settings:
trainer:
# Minimum training steps (for warmup)
min_steps: 2000
# Maximum training steps before stopping
max_steps: 100000
data:
# Batch size for training
batch_size: 16 # Reduce for large models or limited GPU memory
# Data directories
train_dir: ${paths.data_dir}/mydata/processed/train
val_dir: ${paths.data_dir}/mydata/processed/val
test_dir: ${paths.data_dir}/mydata/processed/test
πΎ Checkpointing
Configure model saving behavior:
callbacks:
model_checkpoint:
# Metric to monitor for saving best model
monitor: "val/condition_loss" # Options: val/loss, val/condition_loss, etc.
# Filename pattern for saved checkpoints
filename: "${tags[0]}" # Uses the first tag as filename
# Save mode
mode: "min" # min for losses, max for metrics like accuracy
π Logging
Configure experiment logging with these options:
logger:
# WandB logging
wandb:
project: "cpgpt"
name: "${tags[0]}"
tags: ${tags}
group: "${task_name}"
# TensorBoard logging
tensorboard:
name: "tensorboard"
save_dir: "logs/tensorboard/"
# CSV logging
csv:
name: "csv"
save_dir: "logs/csv/"
Available loggers include:
wandb.yaml
: Weights & Biases for experiment tracking with visualizationtensorboard.yaml
: TensorBoard for local visualizationcsv.yaml
: Simple CSV logging for offline analysismlflow.yaml
: MLflow for organization-level experiment tracking
What methylation array platforms are supported?
CpGPT was pretrained with bulk data from all of the available Illumina arrays, besides the Horvath Mammalian array, at the time of writing. Nevertheless, CpGPT should be able to generalize to new arrays and unseen genomic loci. For RRBS and other types of sequencing-based methylation measurements, finetuning with at least a subset of the data is highly recommended.How much data do I need to fine-tune CpGPT?
CpGPT can be fine-tuned with as few as 50-100 samples for simple tasks. For complex tasks or higher accuracy, we recommend 500+ samples.Should I filter the CpG sites prior to finetuning?
That depends on the task and the reason for finetuning. For instance, to finetune for a model that does not predict specific phenotypes and is just used to learn whole-genome methylation profiles, then it is best not to filter any features. However, if there is a specific phenotype to be predicted, then using a ridge regression and picking the top N features can speed up the training time required (see below).How many steps should I finetune it for?
That ultimately depends on how many samples and how many features are shown to the model. As a rough guide, showing CpGPT each sample-feature combination 50 times works well. For instance, if there are 100 samples with 10,000 CpG sites each, then with a batch size of 10, 100,000 steps would be ideal.How can I get the very best possible performance?
One trick that can increase training time substantially but can lead to some minor performance improvements is to change the following parameter in the `template.yaml` file:model:
training:
generative_splits: 5
The default for that parameter is 2, which effectively means that generative training is not used.
Can I use CpGPT for commercial purposes?
The current release is for non-commercial research purposes only. Please contact us for licensing information for commercial use.If you use CpGPT in your research, please cite our paper:
@article{camillo2024cpgpt,
title={CpGPT: A Foundation Model for DNA Methylation},
author={de Lima Camillo, Lucas Paulo et al.},
journal={bioRxiv},
year={2024},
doi={10.1101/2024.10.24.619766},
url={https://www.biorxiv.org/content/10.1101/2024.10.24.619766v1}
}
For contact, please email [email protected].
This project is licensed for non-commercial research purposes only. See LICENSE for details.