CpGPT: A Foundation Model for DNA Methylation

📋 Table of Contents

📖 Overview
🚀 Quick Setup
🗄️ CpGCorpus
🐘 Model Zoo
🧪 Tutorials
🔧 Finetuning
❓ FAQ
📚 Citation
☎️ Contact
📜 License

📖 Overview

CpGPT is a foundation model for DNA methylation, trained on genome-wide DNA methylation data. It can generate, impute, and embed methylation profiles, and can be finetuned for various downstream tasks.

🚀 Quick Setup

Prerequisites

Python 3.10+
Poetry
AWS CLI (for downloading dependencies)

Installation Instructions

We recommend using poetry for installation:

# Clone the repository
git clone https://github.com/lcamillo/CpGPT.git
cd CpGPT

# Install poetry if not available
pip install poetry

# Install dependencies with Poetry
poetry install

Alternatively, the package is available through:

# Install with pip
pip install CpGPT

Setting up AWS CLI for Dependencies

Our pre-trained models and data are stored in AWS S3. If you do not already have an AWS account setup, follow these steps:

1. Create an AWS Account

Go to AWS Console and click "Create an AWS Account" in the top right
Follow the signup process:
- Provide email and account name
- Enter your personal/business information
- Add payment information (a credit card is required, but the downloads follow free tier limits)
- Complete identity verification (you'll receive a phone call or text)
- Select a support plan (Free tier is sufficient)

2. Install the AWS CLI

For Linux/macOS:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

For Windows:

Download the AWS CLI MSI installer
Run the downloaded MSI installer and follow the on-screen instructions

Verify installation:

aws --version

3. Create Access Keys

Log in to the AWS Console
Click on your account name in the top right, then "Security credentials"
Scroll down to "Access keys" and click "Create access key"
Select "Command Line Interface (CLI)" as the use case
Check the "I understand..." acknowledgment and click "Next"
IMPORTANT: Download the CSV file or copy both the "Access key ID" and "Secret access key" to a secure location. You will not be able to view the secret access key again.

4. Configure AWS CLI

Run the following command and enter your credentials when prompted:

aws configure

You'll need to input:

AWS Access Key ID: The access key ID from step 3
AWS Secret Access Key: The secret access key from step 3
Default region name: Enter us-east-1 (where our data is hosted)
Default output format: Enter json

5. Test Your Configuration

Verify your setup with this command that lists the contents (without downloading):

aws s3 ls s3://cpgpt-lucascamillo-public/data/cpgcorpus/raw/ --requester-payer requester

You should see a list of GSE folders if your configuration is correct.

🗄️ CpGCorpus

Download the Full Corpus

To download the entire CpGCorpus from our S3 bucket, run the following command:

aws s3 sync s3://cpgpt-lucascamillo-public/data/cpgcorpus/raw ./data/cpgcorpus/raw --requester-payer requester

Directory Layout

The CpGCorpus is organized in a hierarchical structure by GSE (Gene Series) and further by GPL (Platform). Below is an overview of the directory layout and file contents:

cpgcorpus/
  └── raw/
      └── {GSE_ID}/
          └── {GPL_ID}/
              ├── betas/
              │   ├── QCDPB.arrow      # Processed beta values via the R sesame QCDPB pipeline
              │   └── gse_betas.arrow  # Raw beta values downloaded from GEO
              └── metadata/
                  └── metadata.arrow   # Metadata and sample annotations

The "betas" folder contains one of the two files:
- QCDPB.arrow: Processed data from the R sesame QCDPB pipeline.
- gse_betas.arrow: Beta values as originally downloaded from GEO.
The "metadata" folder stores the metadata.arrow file that holds supplementary experimental details.

Supported Methylation Platforms

The corpus includes multiple platforms:

GPL8490 (27k array)
GPL13534 (450k)
GPL18809 (450k)
GPL21145 (EPIC)
GPL23976 (EPIC)
GPL29753 (EPIC)
GPL33022 (EPICv2)
GPL34394 (MSA)

Download a specific sample

To download a specific dataset (for example, GSE163839 using platform GPL13534), run:

aws s3 cp s3://cpgpt-lucascamillo-public/data/cpgcorpus/raw/GSE163839/GPL13534/betas/QCDPB.arrow ./data/GSE163839.arrow --requester-payer requester

🐘 Model Zoo

There are several versions of CpGPT, mainly divided into pretrained and finetuned models. Below, you can find a table with a summary of such versions including the model name for download.

Pre-trained Models

Model	Size	Parameters	Description	Model Name
CpGPT-2M	30MB	~2.5M	Lightweight model for quick experimentation and resource-constrained environments	`small`
CpGPT-100M	1.1GB	~101M	Full-size model for state-of-the-art performance and high accuracy	`large`

Fine-tuned Models

⚠️ Note: Fine-tuned model weights are currently being updated and will be available soon. The table below shows the models that will be provided.

We provide specialized pre-trained models for common tasks:

Model	Parameters	Description	Output	Model Name
CpGPT-2M-Age	~2.9M	Multi-tissue chronological age predictor	Age in years	`age`
CpGPT-2M-AverageAdultWeight	~2.9M	Multi-tissue, pan-mammalian weight predictor	Log1p of average adult weight in kilograms	`average_adultweight`
CpGPT-100M-BoA	~101M	EPICv2 blood imputation	No phenotype is predicted	`boa`
CpGPT-2M-Cancer	~2.9M	Multi-tissue cancer predictor	Logits of cancer status (use sigmoid to get probabilities)	`cancer`
CpGPT-2M-ClockProxies	~3.1M	Blood proxies of five epigenetic clocks	altumage, dunedinpace (x100), grimage2, hrsinchphenoage, pchorvath2013	`clock_proxies`
CpGPT-2M-EpicMammal	~2.5M	Blood EPIC-Mammalian array converter	No phenotype is predicted	`epicvmammal`
CpGPT-100M-Hannum	~101M	450k blood imputation	No phenotype is predicted	`hannum`
CpGPT-100M-HumanRRBSAtlas	~101M	Multi-tissue RRBS imputation	No phenotype is predicted	`human_rrbs_atlas`
CpGPT-100M-Mammalian	~101M	Multi-tissue, pan-mammalian mammalian array imputation	No phenotype is predicted	`mammalian`
CpGPT-2M-MaxLifespan	~2.9M	Multi-tissue, pan-mammalian max lifespan predictor	Log1p of max lifespan in years	`maximum_lifespan`
CpGPT-2M-Mortality	~2.9M	Blood mortality predictor. Please use strict_load=False.	Risk score	`mortality`
CpGPT-2M-RelativeAge	~2.9M	Multi-tissue, pan-mammalian relative age predictor	Relative age (0 to 1)	`relative_age`
CpGPT-100M-sciMETv3	~101M	Brain, single-cell imputation	No phenotype is predicted	`scimetv3`

🧪 Tutorials

🔬 Quick setup

Basic introduction to CpGPT and its capabilities

View Tutorial

🗺️ Reference map

Zero-shot label transfer to a reference dataset

View Tutorial

🔧 Finetuning

⚠️ Warning: Fine-tuning CpGPT models requires a GPU. The training process is computationally intensive and will be extremely slow or may fail entirely without GPU acceleration. We recommend at least 8GB of VRAM for the small model and 24GB+ for the large model.

Getting Started

Download dependencies if you have not already done so by following the steps in the quick setup tutorial notebook.
Prepare your data by following the steps in the quick setup tutorial notebook.

Configuration

Create a configuration file by modifying template in configs/experiment/.
Run fine-tuning with the CLI:

cpgpt-train experiment=template

Get the best checkpoint in the logs folders:

Checkpoint weights: logs/experiment/{experiment_name}/checkpoints/{experiment_name}.ckpt
Model config: logs/experiment/{experiment_name}/.hydra/config.yaml

Configuration Guide

🔍 Model Configuration

CpGPT provides several parameters to customize your model architecture and training process:

Parameter	Description	Examples
`model/net`	Model architecture size	`small.yaml`, `large.yaml`
`model/optimizer`	Optimization algorithm	`adamw.yaml`, `adamwscheduleree.yaml`, `lion.yaml`
`model/scheduler`	Learning rate scheduler	`cosine_warmup.yaml`, `constant.yaml`

📊 Task-Specific Settings

Modify these parameters in your experiment YAML file to customize the model for different tasks:

model:
  training:
    # Type of loss function for condition decoder
    condition_decoder_loss: mae  # Options: mae, mse, ce

    # Weighting for the condition loss vs reconstruction
    loss_weights:
      condition_loss: 0.1

  optimizer:
    # Learning rate
    lr: 0.0001

  net:
    # Enable the condition decoder for prediction tasks
    use_condition_decoder: true

    # Number of target variables to predict
    condition_size: 1  # 1 for regression, can be >1 for multi-target

⚙️ Training Parameters

Control the training process with these settings:

trainer:
  # Minimum training steps (for warmup)
  min_steps: 2000

  # Maximum training steps before stopping
  max_steps: 100000

data:
  # Batch size for training
  batch_size: 16  # Reduce for large models or limited GPU memory

  # Data directories
  train_dir: ${paths.data_dir}/mydata/processed/train
  val_dir: ${paths.data_dir}/mydata/processed/val
  test_dir: ${paths.data_dir}/mydata/processed/test

💾 Checkpointing

Configure model saving behavior:

callbacks:
  model_checkpoint:
    # Metric to monitor for saving best model
    monitor: "val/condition_loss"  # Options: val/loss, val/condition_loss, etc.

    # Filename pattern for saved checkpoints
    filename: "${tags[0]}"  # Uses the first tag as filename

    # Save mode
    mode: "min"  # min for losses, max for metrics like accuracy

📝 Logging

Configure experiment logging with these options:

logger:
  # WandB logging
  wandb:
    project: "cpgpt"
    name: "${tags[0]}"
    tags: ${tags}
    group: "${task_name}"

  # TensorBoard logging
  tensorboard:
    name: "tensorboard"
    save_dir: "logs/tensorboard/"

  # CSV logging
  csv:
    name: "csv"
    save_dir: "logs/csv/"

Available loggers include:

wandb.yaml: Weights & Biases for experiment tracking with visualization
tensorboard.yaml: TensorBoard for local visualization
csv.yaml: Simple CSV logging for offline analysis
mlflow.yaml: MLflow for organization-level experiment tracking

❓ FAQ

What methylation array platforms are supported?

CpGPT was pretrained with bulk data from all of the available Illumina arrays, besides the Horvath Mammalian array, at the time of writing. Nevertheless, CpGPT should be able to generalize to new arrays and unseen genomic loci. For RRBS and other types of sequencing-based methylation measurements, finetuning with at least a subset of the data is highly recommended.

How much data do I need to fine-tune CpGPT?

CpGPT can be fine-tuned with as few as 50-100 samples for simple tasks. For complex tasks or higher accuracy, we recommend 500+ samples.

Should I filter the CpG sites prior to finetuning?

That depends on the task and the reason for finetuning. For instance, to finetune for a model that does not predict specific phenotypes and is just used to learn whole-genome methylation profiles, then it is best not to filter any features. However, if there is a specific phenotype to be predicted, then using a ridge regression and picking the top N features can speed up the training time required (see below).

How many steps should I finetune it for?

That ultimately depends on how many samples and how many features are shown to the model. As a rough guide, showing CpGPT each sample-feature combination 50 times works well. For instance, if there are 100 samples with 10,000 CpG sites each, then with a batch size of 10, 100,000 steps would be ideal.

How can I get the very best possible performance?

One trick that can increase training time substantially but can lead to some minor performance improvements is to change the following parameter in the `template.yaml` file:

model:
  training:
    generative_splits: 5

The default for that parameter is 2, which effectively means that generative training is not used.

Can I use CpGPT for commercial purposes?

The current release is for non-commercial research purposes only. Please contact us for licensing information for commercial use.

📚 Citation

If you use CpGPT in your research, please cite our paper:

@article{camillo2024cpgpt,
  title={CpGPT: A Foundation Model for DNA Methylation},
  author={de Lima Camillo, Lucas Paulo et al.},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.10.24.619766},
  url={https://www.biorxiv.org/content/10.1101/2024.10.24.619766v1}
}

☎️ Contact

For contact, please email [email protected].

📜 License

This project is licensed for non-commercial research purposes only. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
cpgpt		cpgpt
data		data
dependencies		dependencies
logs		logs
tutorials		tutorials
.gitignore		.gitignore
.project-root		.project-root
LICENSE		LICENSE
README.md		README.md
cpgpt_logo.svg		cpgpt_logo.svg
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CpGPT: A Foundation Model for DNA Methylation

📋 Table of Contents

📖 Overview

🚀 Quick Setup

Prerequisites

Installation Instructions

Setting up AWS CLI for Dependencies

🗄️ CpGCorpus

🐘 Model Zoo

🧪 Tutorials

🔬 Quick setup

🗺️ Reference map

🔧 Finetuning

Configuration Guide

❓ FAQ

📚 Citation

☎️ Contact

📜 License

About

Releases 2

Languages

License

lcamillo/CpGPT

Folders and files

Latest commit

History

Repository files navigation

CpGPT: A Foundation Model for DNA Methylation

📋 Table of Contents

📖 Overview

🚀 Quick Setup

Prerequisites

Installation Instructions

Setting up AWS CLI for Dependencies

🗄️ CpGCorpus

🐘 Model Zoo

🧪 Tutorials

🔬 Quick setup

🗺️ Reference map

🔧 Finetuning

Configuration Guide

❓ FAQ

📚 Citation

☎️ Contact

📜 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages