Skip to content

lcamillo/CpGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CpGPT: A Foundation Model for DNA Methylation

CpGPT Logo

Python 3.10+ PyPI PyTorch 2.5+ Lightning 2.5+ License arXiv

Typing SVG

πŸ“‹ Table of Contents

πŸ“– Overview

CpGPT is a foundation model for DNA methylation, trained on genome-wide DNA methylation data. It can generate, impute, and embed methylation profiles, and can be finetuned for various downstream tasks.

πŸš€ Quick Setup

Prerequisites

  • Python 3.10+
  • Poetry
  • AWS CLI (for downloading dependencies)

Installation Instructions

We recommend using poetry for installation:

# Clone the repository
git clone https://github.com/lcamillo/CpGPT.git
cd CpGPT

# Install poetry if not available
pip install poetry

# Install dependencies with Poetry
poetry install

Alternatively, the package is available through:

# Install with pip
pip install CpGPT

Setting up AWS CLI for Dependencies

Our pre-trained models and data are stored in AWS S3. If you do not already have an AWS account setup, follow these steps:

1. Create an AWS Account
  1. Go to AWS Console and click "Create an AWS Account" in the top right
  2. Follow the signup process:
    • Provide email and account name
    • Enter your personal/business information
    • Add payment information (a credit card is required, but the downloads follow free tier limits)
    • Complete identity verification (you'll receive a phone call or text)
    • Select a support plan (Free tier is sufficient)
2. Install the AWS CLI

For Linux/macOS:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

For Windows:

  • Download the AWS CLI MSI installer
  • Run the downloaded MSI installer and follow the on-screen instructions

Verify installation:

aws --version
3. Create Access Keys
  1. Log in to the AWS Console
  2. Click on your account name in the top right, then "Security credentials"
  3. Scroll down to "Access keys" and click "Create access key"
  4. Select "Command Line Interface (CLI)" as the use case
  5. Check the "I understand..." acknowledgment and click "Next"
  6. IMPORTANT: Download the CSV file or copy both the "Access key ID" and "Secret access key" to a secure location. You will not be able to view the secret access key again.
4. Configure AWS CLI

Run the following command and enter your credentials when prompted:

aws configure

You'll need to input:

  • AWS Access Key ID: The access key ID from step 3
  • AWS Secret Access Key: The secret access key from step 3
  • Default region name: Enter us-east-1 (where our data is hosted)
  • Default output format: Enter json
5. Test Your Configuration

Verify your setup with this command that lists the contents (without downloading):

aws s3 ls s3://cpgpt-lucascamillo-public/data/cpgcorpus/raw/ --requester-payer requester

You should see a list of GSE folders if your configuration is correct.

πŸ—„οΈ CpGCorpus

Download the Full Corpus

To download the entire CpGCorpus from our S3 bucket, run the following command:

aws s3 sync s3://cpgpt-lucascamillo-public/data/cpgcorpus/raw ./data/cpgcorpus/raw --requester-payer requester
Directory Layout

The CpGCorpus is organized in a hierarchical structure by GSE (Gene Series) and further by GPL (Platform). Below is an overview of the directory layout and file contents:

cpgcorpus/
  └── raw/
      └── {GSE_ID}/
          └── {GPL_ID}/
              β”œβ”€β”€ betas/
              β”‚   β”œβ”€β”€ QCDPB.arrow      # Processed beta values via the R sesame QCDPB pipeline
              β”‚   └── gse_betas.arrow  # Raw beta values downloaded from GEO
              └── metadata/
                  └── metadata.arrow   # Metadata and sample annotations
  • The "betas" folder contains one of the two files:
    • QCDPB.arrow: Processed data from the R sesame QCDPB pipeline.
    • gse_betas.arrow: Beta values as originally downloaded from GEO.
  • The "metadata" folder stores the metadata.arrow file that holds supplementary experimental details.
Supported Methylation Platforms

The corpus includes multiple platforms:

  • GPL8490 (27k array)
  • GPL13534 (450k)
  • GPL18809 (450k)
  • GPL21145 (EPIC)
  • GPL23976 (EPIC)
  • GPL29753 (EPIC)
  • GPL33022 (EPICv2)
  • GPL34394 (MSA)
Download a specific sample

To download a specific dataset (for example, GSE163839 using platform GPL13534), run:

aws s3 cp s3://cpgpt-lucascamillo-public/data/cpgcorpus/raw/GSE163839/GPL13534/betas/QCDPB.arrow ./data/GSE163839.arrow --requester-payer requester

🐘 Model Zoo

There are several versions of CpGPT, mainly divided into pretrained and finetuned models. Below, you can find a table with a summary of such versions including the model name for download.

Pre-trained Models
Model Size Parameters Description Model Name
CpGPT-2M 30MB ~2.5M Lightweight model for quick experimentation and resource-constrained environments small
CpGPT-100M 1.1GB ~101M Full-size model for state-of-the-art performance and high accuracy large
Fine-tuned Models

⚠️ Note: Fine-tuned model weights are currently being updated and will be available soon. The table below shows the models that will be provided.

We provide specialized pre-trained models for common tasks:

Model Parameters Description Output Model Name
CpGPT-2M-Age ~2.9M Multi-tissue chronological age predictor Age in years age
CpGPT-2M-AverageAdultWeight ~2.9M Multi-tissue, pan-mammalian weight predictor Log1p of average adult weight in kilograms average_adultweight
CpGPT-100M-BoA ~101M EPICv2 blood imputation No phenotype is predicted boa
CpGPT-2M-Cancer ~2.9M Multi-tissue cancer predictor Logits of cancer status (use sigmoid to get probabilities) cancer
CpGPT-2M-ClockProxies ~3.1M Blood proxies of five epigenetic clocks altumage, dunedinpace (x100), grimage2, hrsinchphenoage, pchorvath2013 clock_proxies
CpGPT-2M-EpicMammal ~2.5M Blood EPIC-Mammalian array converter No phenotype is predicted epicvmammal
CpGPT-100M-Hannum ~101M 450k blood imputation No phenotype is predicted hannum
CpGPT-100M-HumanRRBSAtlas ~101M Multi-tissue RRBS imputation No phenotype is predicted human_rrbs_atlas
CpGPT-100M-Mammalian ~101M Multi-tissue, pan-mammalian mammalian array imputation No phenotype is predicted mammalian
CpGPT-2M-MaxLifespan ~2.9M Multi-tissue, pan-mammalian max lifespan predictor Log1p of max lifespan in years maximum_lifespan
CpGPT-2M-Mortality ~2.9M Blood mortality predictor. Please use strict_load=False. Risk score mortality
CpGPT-2M-RelativeAge ~2.9M Multi-tissue, pan-mammalian relative age predictor Relative age (0 to 1) relative_age
CpGPT-100M-sciMETv3 ~101M Brain, single-cell imputation No phenotype is predicted scimetv3

πŸ§ͺ Tutorials

More tutorials will be added soon!

πŸ”¬ Quick setup

Basic introduction to CpGPT and its capabilities

View Tutorial

πŸ—ΊοΈ Reference map

Zero-shot label transfer to a reference dataset

View Tutorial

πŸ”§ Finetuning

⚠️ Warning: Fine-tuning CpGPT models requires a GPU. The training process is computationally intensive and will be extremely slow or may fail entirely without GPU acceleration. We recommend at least 8GB of VRAM for the small model and 24GB+ for the large model.

Getting Started
  1. Download dependencies if you have not already done so by following the steps in the quick setup tutorial notebook.

  2. Prepare your data by following the steps in the quick setup tutorial notebook.

Configuration
  1. Create a configuration file by modifying template in configs/experiment/.

  2. Run fine-tuning with the CLI:

cpgpt-train experiment=template
  1. Get the best checkpoint in the logs folders:
  • Checkpoint weights: logs/experiment/{experiment_name}/checkpoints/{experiment_name}.ckpt
  • Model config: logs/experiment/{experiment_name}/.hydra/config.yaml

Configuration Guide

πŸ” Model Configuration

CpGPT provides several parameters to customize your model architecture and training process:

Parameter Description Examples
model/net Model architecture size small.yaml, large.yaml
model/optimizer Optimization algorithm adamw.yaml, adamwscheduleree.yaml, lion.yaml
model/scheduler Learning rate scheduler cosine_warmup.yaml, constant.yaml
πŸ“Š Task-Specific Settings

Modify these parameters in your experiment YAML file to customize the model for different tasks:

model:
  training:
    # Type of loss function for condition decoder
    condition_decoder_loss: mae  # Options: mae, mse, ce

    # Weighting for the condition loss vs reconstruction
    loss_weights:
      condition_loss: 0.1

  optimizer:
    # Learning rate
    lr: 0.0001

  net:
    # Enable the condition decoder for prediction tasks
    use_condition_decoder: true

    # Number of target variables to predict
    condition_size: 1  # 1 for regression, can be >1 for multi-target
βš™οΈ Training Parameters

Control the training process with these settings:

trainer:
  # Minimum training steps (for warmup)
  min_steps: 2000

  # Maximum training steps before stopping
  max_steps: 100000

data:
  # Batch size for training
  batch_size: 16  # Reduce for large models or limited GPU memory

  # Data directories
  train_dir: ${paths.data_dir}/mydata/processed/train
  val_dir: ${paths.data_dir}/mydata/processed/val
  test_dir: ${paths.data_dir}/mydata/processed/test
πŸ’Ύ Checkpointing

Configure model saving behavior:

callbacks:
  model_checkpoint:
    # Metric to monitor for saving best model
    monitor: "val/condition_loss"  # Options: val/loss, val/condition_loss, etc.

    # Filename pattern for saved checkpoints
    filename: "${tags[0]}"  # Uses the first tag as filename

    # Save mode
    mode: "min"  # min for losses, max for metrics like accuracy
πŸ“ Logging

Configure experiment logging with these options:

logger:
  # WandB logging
  wandb:
    project: "cpgpt"
    name: "${tags[0]}"
    tags: ${tags}
    group: "${task_name}"

  # TensorBoard logging
  tensorboard:
    name: "tensorboard"
    save_dir: "logs/tensorboard/"

  # CSV logging
  csv:
    name: "csv"
    save_dir: "logs/csv/"

Available loggers include:

  • wandb.yaml: Weights & Biases for experiment tracking with visualization
  • tensorboard.yaml: TensorBoard for local visualization
  • csv.yaml: Simple CSV logging for offline analysis
  • mlflow.yaml: MLflow for organization-level experiment tracking

❓ FAQ

What methylation array platforms are supported? CpGPT was pretrained with bulk data from all of the available Illumina arrays, besides the Horvath Mammalian array, at the time of writing. Nevertheless, CpGPT should be able to generalize to new arrays and unseen genomic loci. For RRBS and other types of sequencing-based methylation measurements, finetuning with at least a subset of the data is highly recommended.
How much data do I need to fine-tune CpGPT? CpGPT can be fine-tuned with as few as 50-100 samples for simple tasks. For complex tasks or higher accuracy, we recommend 500+ samples.
Should I filter the CpG sites prior to finetuning? That depends on the task and the reason for finetuning. For instance, to finetune for a model that does not predict specific phenotypes and is just used to learn whole-genome methylation profiles, then it is best not to filter any features. However, if there is a specific phenotype to be predicted, then using a ridge regression and picking the top N features can speed up the training time required (see below).
How many steps should I finetune it for? That ultimately depends on how many samples and how many features are shown to the model. As a rough guide, showing CpGPT each sample-feature combination 50 times works well. For instance, if there are 100 samples with 10,000 CpG sites each, then with a batch size of 10, 100,000 steps would be ideal.
How can I get the very best possible performance? One trick that can increase training time substantially but can lead to some minor performance improvements is to change the following parameter in the `template.yaml` file:
model:
  training:
    generative_splits: 5

The default for that parameter is 2, which effectively means that generative training is not used.

Can I use CpGPT for commercial purposes? The current release is for non-commercial research purposes only. Please contact us for licensing information for commercial use.

πŸ“š Citation

If you use CpGPT in your research, please cite our paper:

@article{camillo2024cpgpt,
  title={CpGPT: A Foundation Model for DNA Methylation},
  author={de Lima Camillo, Lucas Paulo et al.},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.10.24.619766},
  url={https://www.biorxiv.org/content/10.1101/2024.10.24.619766v1}
}

☎️ Contact

For contact, please email [email protected].

πŸ“œ License

This project is licensed for non-commercial research purposes only. See LICENSE for details.


Β© 2024 Lucas Paulo de Lima Camillo

Twitter Follow