A lightweight video action recognition framework that combines Motion Vectors (MV) with CLIP for efficient video understanding on the UCF-101 dataset.
MoCLIP-Lite leverages compressed video motion vectors as a lightweight alternative to traditional RGB-based video analysis, combined with CLIP's powerful text-image understanding capabilities for zero-shot action recognition.
- Motion Vector Processing: Efficient extraction and processing of motion vectors from compressed videos
- Multi-modal Fusion: Late fusion of motion vector features with CLIP text embeddings
- Zero-shot Learning: CLIP-based zero-shot action recognition capabilities
- Lightweight Architecture: Based on EfficientNet for efficient inference
- Comprehensive Evaluation: Support for both supervised and zero-shot evaluation
The framework consists of three main components:
- MV-TSN Model: Motion Vector Temporal Segment Network based on EfficientNet
- CLIP Integration: Pre-trained CLIP model for text-image understanding
- Late Fusion: Combines MV features with CLIP text embeddings for final classification
torch>=1.9.0
torchvision>=0.10.0
transformers>=4.20.0
timm>=0.6.0
opencv-python>=4.5.0
numpy>=1.21.0
matplotlib>=3.5.0
tqdm>=4.64.0
PIL>=8.3.0- Clone the repository:
git clone https://github.com/microa/MoCLIP-Lite.git
cd MoCLIP-Lite- Install dependencies:
pip install -r requirements.txt- Install the package (optional, for development):
pip install -e .- Download UCF-101 dataset and extract to your data directory
- Extract motion vectors from videos
- Update data paths in the configuration files
python training/train_mv_tsn.py --data_root /path/to/ucf101 --mv_root /path/to/motion_vectorspython training/train_fusion.py --mv_model_path mv_tsn_best_model.pth --data_root /path/to/ucf101python evaluation/evaluate_zeroshot.py --data_root /path/to/ucf101 --mv_root /path/to/motion_vectorspython evaluation/test_fusion_final.py --model_path fusion_best_model.pth --data_root /path/to/ucf101MoCLIP-Lite/
├── README.md # Project documentation
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
├── setup.py # Package installation script
├── configs/ # Configuration files
│ ├── __init__.py
│ ├── class_mappings.json # Class name mappings
│ └── prompt_templates.json # Text prompt templates
├── data/ # Data processing modules
│ ├── __init__.py
│ ├── dataloader.py # Basic data loader
│ ├── dataloader_coviar.py # TSN data loader
│ └── transforms_video.py # Video transformations
├── models/ # Model definitions
│ ├── __init__.py
│ └── model.py # MV-TSN model definition
├── training/ # Training scripts
│ ├── __init__.py
│ ├── train_mv_tsn.py # MV model training
│ ├── train_fusion.py # Fusion model training
│ └── train_mv_only.py # MV-only training
├── evaluation/ # Evaluation and testing
│ ├── __init__.py
│ ├── evaluate_zeroshot.py # Zero-shot evaluation
│ ├── test_fusion_final.py # Fusion model testing
│ ├── test_mv_tsn.py # MV model testing
│ └── test_mv_tsn_32.py # MV model testing (32 segments)
├── utils/ # Utility functions
│ ├── __init__.py
│ ├── mv_quiver.py # Motion vector visualization
│ ├── mv_to_rgb.py # MV to RGB conversion
│ ├── generate_text_features.py # Text feature generation
│ └── precompute_clip_features.py # CLIP feature precomputation
└── scripts/ # Example scripts
├── __init__.py
└── example_usage.py # Usage examples
The framework includes tools for visualizing motion vectors and generating qualitative analysis reports:
python utils/mv_quiver.py --input_dir /path/to/motion_vectors --output_dir /path/to/visualizationsThe model achieves competitive performance on UCF-101 dataset with significantly reduced computational requirements compared to RGB-based methods.
If you use this code in your research, please cite our paper:
@article{huang2025mocliplite,
title={MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors},
author={Huang, Binhua and Wang, Nan and Parakash, Arjun and Dev, Soumyabrata},
journal={arXiv preprint arXiv:2509.17084},
year={2025}
}Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- UCF-101 dataset creators
- CLIP model by OpenAI
- EfficientNet by Google Research
- PyTorch and the open-source community