BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition Fei Long*, Xiaoou Li*, Jiaming Lv*, Haoyuan Yang, Xianjun Cheng, Peihua Li. ICML 2025 (Poster)
In this paper, we propose BDC-CLIP for action recognition, a novel framework that leverages Brownian Distance Covariance (BDC) to enhance video-language alignment. Unlike prior methods that rely on cosine similarity between global tokens-limiting their ability to capture complex dependencies-BDC-CLIP models both linear and nonlinear relationships across all visual and textual tokens. This enables fine-grained alignment in space, time, and language, which is crucial for understanding dynamic human actions. BDC-CLIP achieves state-of-the-art performance in zero-shot, few-shot, base-to-novel, and fully supervised settings, demonstrating its effectiveness and generality.
- New metric for multimodal alignment: Introduces BDC as an alternative to cosine similarity, capturing complex statistical dependencies for more fine-grained alignment.
- Temporal BDC Attention: Propose a temporal BDC attention that captures patch-wise importance and temporal dynamics.
- Strong generalization capability: Achieves state-of-the-art results in zero-shot, few-shot, base-to-novel, and fully-supervised video recognition.
BDC_CLIP/
βββ assets/ # svg figures for README.md
βββ clip/ # CLIP model implementation
βββ configs/ # Configuration files for different experimental settings
βββ datasets/ # Dataset loaders and preprocessing pipeline
βββ datasets_splits/ # Dataset splits for various evaluation protocols
βββ labels/ # Label lists for each dataset under different settings
βββ prompts/ # Prompts generated by LLMs for each dataset
βββ scripts/ # Training and evaluation scripts
βββ trainers/ # Core training modules implementing our methods
βββ utils/ # Utility functions and helper tools
βββ main.py # Entry point for training and evaluation
βββ README.md # Project documentation
βββ requirements.txt # Python dependenciesconda create -n bdc_clip python=3.10
conda activate bdc_clip
pip install -r requirements.txtπ» Our experiments were conducted using two NVIDIA RTX 4090 GPUs.
Supported datasets:
* Kinetics-400
* Kinetics-600
* UCF-101
* HMDB-51
* Something-Something V2Please follow the instructions provide by TC_CLIP for data preparation.
# Modefied the daset path in configs
# Pretraining on Kinetics-400 (zero-shot)
cd scripts/pre_training
bash zero_shot_pretrain.sh
# Zero-shot evaluation
cd scripts/zero_shot
bash zero_shot_hmdb51.sh| Setting | HMDB-51 | UCF-101 | K600 | Model |
|---|---|---|---|---|
| w/o WSE | 59.4 | 85.9 | 76.5 |
OneDrive Baiduyun |
| w WSE | 60.7 | 86.9 | 78.9 |
| Dataset | 2-shot | 4-shot | 8-shot | 16-shot | Pretrained Model | Finetuned Model |
|---|---|---|---|---|---|---|
| HMDB-51 | 68.0 | 71.5 | 75.9 | 77.3 |
OneDrive Baiduyun |
|
| UCF-101 | 94.9 | 96.7 | 97.5 | 98.5 |
OneDrive Baiduyun |
|
| SSv2 | 9.9 | 11.6 | 14.4 | 19.5 |
OneDrive Baiduyun |
| Dataset | Base | Novel | HM | Pretrained Model | Finetuned Model |
|---|---|---|---|---|---|
| HMDB-51 | 81.0 | 61.3 | 69.8 |
OneDrive Baiduyun |
|
| UCF-101 | 97.5 | 88.0 | 92.5 |
OneDrive Baiduyun |
|
| SSv2 | 20.9 | 18.2 | 19.5 |
OneDrive Baiduyun |
If you find this work useful, please consider citing:
@inproceedings{long2025bdcclip,
title = {BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition},
author = {Fei Long and Xiaoou Li and Jiaming Lv and Haoyuan Yang and Xianjun Cheng and Peihua Li},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2025}
}Our project is partially based on the open-source projects ViFiCLIP, TC_CLIP, and FROSTER. We sincerely acknowledge their contributions.
If you have any questions or suggestions, feel free to contact:
Fei Long: [email protected]
