ECLARE: multi-teacher contrastive learning via ensemble distillation for diagonal integration of single-cell multi-omic data
This repository is dedicated to Ensemble knowledge distillation for Contrastive Learning of ATAC and RNA Embeddings, a.k.a. ECLARE ⚡🍰.
The manuscript is currently available on bioRxiv.
Installation
-
First, clone the repository:
git clone https://github.com/li-lab-mcgill/ECLARE.git cd ECLARE -
Create a virtual environment (use Python 3.9.6 for best reproducibility):
python -m venv eclare_env
-
Activate the virtual environment
Windows
eclare_env\Scripts\activate
macOS and Linux
source eclare_env/bin/activateGit Bash on Windows
source eclare_env/Scripts/activate -
Install the package: For standard installation:
pip install .For editable installation (recommended for development):
pip install -e .
Configuration
Before running the application, you need to set up your configuration file. Follow these steps:
-
Copy the template configuration file:
cp config/config_template.yaml config/config.yaml
-
Edit
config.yamlto suit your environment. Update paths and settings as necessary:active_environment: "local_directories" local_directories: outpath: "/your/custom/output/path" datapath: "/your/custom/data/path"
Requirements
- Python ≥ 3.9 (3.9.6 for best reproducibility)
- See
setup.pyfor a complete list of dependencies
Overview of ECLARE framework
ECLARE (Ensemble knowledge distillation for Contrastive Learning of ATAC and RNA Embeddings) is a framework designed to integrate single-cell multi-omic data, specifically scRNA-seq and scATAC-seq data, through these key components:
-
Multi-Teacher Knowledge Distillation:
- Multiple teacher models are trained on paired datasets (where RNA and ATAC data are available for the same cells)
- These teachers then guide a student model that works with unpaired data
- This approach helps transfer knowledge from well-understood paired samples to situations where only unpaired data is available
-
Contrastive Learning:
- Uses a refined contrastive learning objective to learn representations of both RNA and ATAC data
- Helps align features across different modalities (RNA and ATAC)
- Enables the model to understand relationships between different data types
-
Transport-based Loss:
- Implements a transport-based loss function for precise alignment between RNA and ATAC modalities
- Helps ensure that the learned representations are biologically meaningful
The framework is particularly valuable because it:
- Addresses the common problem of limited paired multi-omic data
- Enables integration of unpaired data through knowledge transfer
- Preserves biological structure in the integrated data
- Facilitates downstream analyses like gene regulatory network inference
Figure 1 from manuscript: Overview of ECLARE
Manuscript figure/code map
Main figures:
-
Figure 1 (overview schematic)
- Core model/losses:
src/eclare/models.py,src/eclare/losses_and_distances_utils.py - Training loop + orchestration:
src/eclare/run_utils.py,scripts/eclare_scripts/eclare_run.py
- Core model/losses:
-
Figure 2 (benchmarking metrics; paired + unpaired MDD)
- ECLARE/KD-CLIP/CLIP training:
scripts/eclare_scripts/eclare_run.py,scripts/kd_clip_scripts,scripts/clip_scripts/clip_run.py - Baselines:
scripts/benchmark_diagonal,scripts/benchmark_vertical - Metrics + plots:
src/eclare/eval_utils.py,scripts/plot_figures.py
- ECLARE/KD-CLIP/CLIP training:
-
Figure 3 (MDD embeddings + enrichment + GRN subnetwork)
- MDD embedding + enrichment:
scripts/enrichment_analyses.py - Enrichment plotting:
scripts/enrichment_plots.py - GREAT validation:
scripts/rGREAT_analysis.R
- MDD embedding + enrichment:
-
Figure 4 (developmental integration)
- ECLARE developmental analysis:
scripts/developmental_post_hoc.py - Ordinal pseudotime + DPT:
scripts/ordinal_post_hoc.py - SCENIC+ eRegulon scoring:
scripts/scenicplus_post_hoc.py
- ECLARE developmental analysis:
-
Figure 5 (longitudinal MDD co-embedding)
- Co-embedding, label transfer, PAGA:
scripts/developmental_post_hoc.py - Ordinal model training:
scripts/ordinal_scripts/ordinal_run.py - DE + EnrichR for longitudinal hits:
scripts/pydeseq2_developmental_analysis.py,scripts/enrichment_plots.py
- Co-embedding, label transfer, PAGA:
Supplementary figures:
-
Figure S1 (Brain.GMT enrichment, sc-compReg vs pyDESeq2)
scripts/enrichment_analyses.py,scripts/enrichment_plots.py
-
Figure S2 (GREAT enrichment)
scripts/rGREAT_analysis.R,scripts/enrichment_plots.py
-
Figure S3 (H-MAGMA enrichment)
scripts/enrichment_analyses.py,scripts/enrichment_plots.py
-
Figure S4 (module-score DE for Brain.GMT sets)
scripts/enrichment_analyses.py,scripts/enrichment_plots.py
-
Figure S5 (ABHD17B external expression evidence)
- External sources (GTEx/psychSCREEN); no generation script in this repo
-
Figure S6 (ECLARE vs scJoint vs GLUE embeddings with DPT)
- ECLARE:
scripts/developmental_post_hoc.py - scJoint:
scripts/benchmark_diagonal/scJoint/scJoint_latents.py - GLUE:
scripts/benchmark_vertical/glue/glue_latents.py
- ECLARE:
-
Figure S7 (CORAL ordinal embeddings: PFC_V1_Wang -> Cortex_Velmeshev)
scripts/ordinal_post_hoc.py,scripts/ordinal_scripts/ordinal_run.py
-
Figure S8 (Velmeshev density in ECLARE embedding)
scripts/developmental_post_hoc.py
-
Figure S9 (CORAL ordinal embeddings: PFC_Zhu -> MDD)
scripts/ordinal_post_hoc.py,scripts/ordinal_scripts/ordinal_run.py
-
Figure S10 (balancing donor age by modality/condition)
scripts/developmental_post_hoc.py
-
Figure S11 (co-embedding density: PFC_Zhu vs MDD)
scripts/developmental_post_hoc.py
-
Figure S12 (male-specific pseudotime branch analysis)
scripts/developmental_post_hoc.py
-
Figure S13 (pseudotemporal gene clusters)
scripts/cluster_dev_genes_by_km.py
-
Figure S14 (EnrichR for km3_mdd + EGR1 regulon overlap)
scripts/pydeseq2_developmental_analysis.py,scripts/enrichment_plots.py
-
Figure S15 (pychromVAR differential accessibility)
scripts/pydeseq2_developmental_analysis.py
-
Figure S16 (EGR1 eRegulon scores, male donors)
scripts/developmental_post_hoc.py,scripts/scenicplus_post_hoc.py
Demo: analysis on sample paired datasets
We provide a demo notebook sample_analysis.ipynb to analyze the sample paired datasets.
This analysis is based on using DLPFC_Anderson and DLPFC_Ma as source datasets and PFC_Zhu as target dataset. See Table 1 in the manuscript for more details about datasets.
Sample data is available from Zenodo at https://doi.org/10.5281/zenodo.14794845. Instructions for downloading the data are available in the notebook.
