Skip to content

deepnoid-ai/MVHybrid

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔬 MVHybrid: Hybrid State Space-Vision Transformer for Pathology Foundation Models

MICCAI 2025 Workshop COMPAYL (Oral)

Paper Conference Python License: CC BY-NC-SA 4.0

Won June Cho1, Hongjun Yoon1, Daeky Jeong1, Hyeongyeol Lim1, Yosep Chong2

1AI Research Team 2, AI Research Lab, Deepnoid | 2Department of Hospital Pathology, The Catholic University of Korea


🌟 Highlights

🎯 57% Higher Correlation
than best ViT in LOSO (leave-one-study-out) evaluation
💪 43% Better Robustness
or smaller performance drop than best ViT
🏆 State-of-the-Art
backbone on biomarker prediction tasks

📖 Abstract

Spatial transcriptomics reveals gene expression patterns within tissue context, enabling precision oncology applications such as treatment response prediction, but its high cost and technical complexity limit clinical adoption. Predicting spatial gene expression (biomarkers) from routine histopathology images offers a practical alternative, yet current vision foundation models (VFMs) in pathology based on Vision Transformer (ViT) backbones perform below clinical standards. Given that VFMs are trained on millions of diverse whole slide images, we hypothesize that architectural innovations beyond ViTs may better capture the low-frequency, subtle morphological patterns correlating with molecular phenotypes. By demonstrating that state space models initialized with negative real eigenvalues exhibit strong low-frequency bias, we introduce MVHybrid, a hybrid backbone architecture combining state space models (SSMs) with ViT. We compare five other different backbone architectures for pathology VFMs, all pretrained on identical colorectal cancer datasets using the DINOv2 self-supervised learning method. We evaluate all pretrained models using both random split and leave-one-study-out (LOSO) settings of the same biomarker dataset. In LOSO evaluation, MVHybrid achieves 57% higher correlation than the best-performing ViT and shows 43% smaller performance degradation compared to random split in gene expression prediction, demonstrating superior performance and robustness, respectively. Furthermore, MVHybrid shows equal or better downstream performance in classification, patch retrieval, and survival prediction tasks compared to that of ViT, showing its promise as a next-generation pathology VFM backbone.

🔑 Key Contributions

  1. Hybrid architecture combining MambaVision's SSM layers with ViT for enhanced low-frequency feature capture resulting in superior performance and robustness demonstrated through 57% higher correlation and 43% smaller performance degradation in distribution shift scenarios compared to pure ViT backbone
  2. First systematic comparison of multiple VFM backbone architectures (SSM variants and ViT) pretrained and evaluated on identical datasets

📂 Dataset

Pretraining Data

  • HunCRC: Digital Pathological Slides from Hungarian Colorectal Cancer Screening
  • IMP-CRS2024: IMP Whole-Slide Images of Colorectal Samples 2024

Spatial Transcriptomics Data for Biomarker Prediction

  • HEST-1k: Human Embedded Spatial Transcriptomics dataset
    • HEST-Benchmark: 8 WSI-ST pairs from 4 patients (colorectal samples)
    • HEST-Extended: 54 samples from 8 different study sources (COAD, READ, COADREAD)
    • Data includes 10X Visium, VisiumHD, and Xenium spatial transcriptomics paired with H&E WSIs

Preprocessing

  • WSI patches: Extracted using CLAM's patching function with biopsy preset at 256×256 resolution
  • Gene expression: Normalized using log1p transformation
  • Gene selection:
    • HVG (Highly Variable Genes): Top genes with high expression variance across samples (via scanpy)
    • HMHVG (High Mean Highly Variable Genes): Top genes from HVG that are also highly expressed across samples

🏗️ Architecture

MVHybrid Architecture

MVHybrid architecture features:

  • First 12 layers: MambaVision (MV) sequence mixing blocks (red) with EinFFT channel mixing blocks (blue)

  • Last 12 layers: Standard Vision Transformer (ViT) blocks with attention (red) and MLP (blue)

  • Key Innovation: Negative real eigenvalues in SSM layers provide enhanced low-frequency bias for capturing subtle biological features. Following Yu et al.'s work, MVHybrid leverages the mathematical property that SSMs with negative real eigenvalues exhibit stronger low-frequency bias compared to complex eigenvalues:

    • Complex eigenvalues: Total variation ~ O(1/(ω₀ - wⱼ))
    • Negative real eigenvalues: Total variation ~ O(1/ω₀)

    This faster decay at high frequencies allows MVHybrid to better capture subtle morphological patterns associated with molecular phenotypes.

📊 Main Results

🧬 Biomarker Prediction Performance

Evaluation Methodology: All models were evaluated by training Ridge regression on extracted patch embeddings to predict spatial gene expression values. The evaluation uses:

  • HEST-Benchmark: Patient-wise 4-fold cross-validation with top 50 HVGs
  • HEST-Extended: Two evaluation settings to assess robustness:
    • Random Split: 10-fold cross-validation mixing all study sources
    • LOSO (Leave-One-Study-Out): 8-fold evaluation where each study source is held out as test set
  • Metrics: Pearson Correlation Coefficient (PCC) for all genes and top-10 genes (PCC-10), Mean Absolute Error (MAE), Mean Squared Error (MSE)

HEST-Benchmark Results

Model PCC ↑ PCC-10 ↑ MAE ↓ MSE ↓
ViMEinFFT 0.397±0.065 0.685±0.069 1.896±0.332 5.956±1.985
HydraEinFFT 0.404±0.064 0.692±0.067 1.879±0.270 5.781±1.674
ViT12 0.415±0.055 0.720±0.097 1.807±0.355 5.392±2.064
ViT24 0.365±0.042 0.664±0.080 1.869±0.285 5.822±1.834
HydraHybrid 0.415±0.069 0.688±0.082 1.824±0.386 5.618±2.157
MVHybrid 0.460±0.082 0.747±0.082 1.748±0.265 5.011±1.478

HEST-Extended HVG Results (n=200)

Model Eval Type PCC ↑ PCC-10 ↑ MSE ↓ MAE ↓
MVHybrid Random 0.214±0.122 0.564±0.129 0.594±0.283 0.488±0.120
LOSO 0.138±0.102 0.386±0.175 0.881±0.671 0.614±0.281
Drop 35.5% 31.5% +48.2% +25.9%
ViT12 (best baseline) Random 0.210±0.146 0.555±0.143 0.593±0.240 0.488±0.101
LOSO 0.097±0.108 0.349±0.174 1.003±0.642 0.674±0.265
Drop 53.7% 37.2% +69.2% +38.1%

HEST-Extended HMHVG Results (n=200)

Model Eval Type PCC ↑ PCC-10 ↑ MSE ↓ MAE ↓
MVHybrid Random 0.393±0.162 0.620±0.116 4.542±1.454 1.776±0.337
LOSO 0.212±0.166 0.454±0.168 5.889±4.161 1.957±0.853
Drop 46.0% 26.8% +29.7% +10.2%
ViT12 (best baseline) Random 0.373±0.212 0.605±0.142 4.581±1.344 1.780±0.293
LOSO 0.110±0.203 0.377±0.182 7.123±5.225 2.174±0.923
Drop 70.6% 37.6% +55.5% +22.1%

Key Finding: MVHybrid demonstrates higher correlation and better robustness to distribution shifts (LOSO evaluation) with the smallest performance degradation across all metrics.

🎯 Other Downstream Tasks

Beyond biomarker prediction, MVHybrid shows equal or better performance compared to ViTs across multiple downstream tasks in colorectal cancer:

  • Classification tasks: MSI/MSS molecular-based classification on TCGA-CRC-MSI WSI dataset, morphology-based subtype/tissue classification on MHIST/UniToPatho patch datasets
  • Zero-shot patch retrieval: Embedding prototype based patch retrieval on NCT-CRC-100K patch dataset
  • Survival prediction: Unsupervised Gaussian Mixture Models (GMM) based survival prediction on TCGA-CRC WSI dataset

For detailed results and analysis of these tasks, please refer to our full paper.

🚀 Getting Started

Prerequisites

  • Python 3.11
  • CUDA 12.1 compatible GPU
  • SLURM cluster for multi-node training (preferred)

Installation

# Clone the repository
git clone https://github.com/chokevin8/MVHybrid.git
cd MVHybrid

# Create conda environment
conda create -n mambavision python==3.11
conda activate mambavision

# Install PyTorch and dependencies
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 xformers --index-url https://download.pytorch.org/whl/cu121
conda install nvidia/label/cuda-12.1.0::cuda-nvcc 

# Install Mamba and other requirements
pip install --no-cache-dir tensorboardX causal-conv1d==1.4.0 mamba-ssm==2.2.2 timm==1.0.9 einops transformers
pip install fvcore submitit omegaconf

Training

The model is trained using DINOv2 self-supervised learning on a SLURM cluster with multi-node multi-GPU (FSDP) setup:

sbatch dino/Train_MVHybrid_DINOv2_SLURM.sh

📝 Citation

If you find this work useful, please cite our paper:

@inproceedings{cho2025mvhybrid,
  title={$MV_{Hybrid}$: Improving Spatial Transcriptomics Prediction with 
         Hybrid State Space-Vision Transformer Backbone in Pathology 
         Vision Foundation Models},
  author={Won June Cho and Hongjun Yoon and Daeky Jeong and 
          Hyeongyeol Lim and Yosep Chong},
  booktitle={MICCAI Workshop on Computational Pathology with 
             Multimodal Data (COMPAYL)},
  year={2025},
  url={https://openreview.net/forum?id=vd1xqJLW4X}
}

🙏 Acknowledgments

This work builds upon and extends:

  • DINOv2: Original DINOv2 method modified for MVHybrid training.
  • MambaVision: Original MambaVision architecture modified from hierarchical to isotropic design for DINOv2 compatibility

Funding

This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: RS-2021-KH113146).

📧 Contact

For questions and collaborations, please reach out to:


About

MVHybrid: Improving Spatial Transcriptomics Prediction with Hybrid State Space-Vision Transformer Backbone in Pathology Vision Foundation Models (MICCAI 2025 COMPAYL Oral)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 88.7%
  • Cuda 7.7%
  • C++ 3.2%
  • Other 0.4%