Official code for "The Devil is in Temporal Token: High Quality Video Reasoning Segmentation" CVPR 2025! 🎉
Comparison with previous VRS approaches. (a) Previous methods utilize a single <SEG> token for keyframe-based segmentation, depending heavily on external models for keyframe detection and mask propagation. This reliance can hinder accurate keyframe localization and prevent end-to-end inference. (b) VRS-HQ introduces frame-level <SEG> and a temporal <TAK> token for dynamic aggregation. The aggregated <TAK> token is then used for both keyframe selection and mask generation within SAM2. This enables single-stage inference with precise keyframe selection and high-quality segmentation. (c) VRS-HQ achieves state-of-the-art performance on various image and video datasets across reasoning and referring segmentation.
(a) VRS-HQ architecture. VRS-HQ incorporates a Multimodal Large Language Model for Temporal Token Encoding (<SEG> and <TAK> tokens), a Temporal Dynamic Aggregation, a Token-driven Keyframe Selection and Mask Decoding and Propogation. (b) Temporal Dynamic Aggregation (TDA) merges frame-level <SEG> tokens into a temporal <TAK> token using a weighted fusion based on cosine similarity. (c) Token-driven Keyframe Selection (TKS). During training, the frame with the <SEG> token closest to the <TAK> token is selected as the keyframe. During inference, keyframe selection is refined using SAM2's occlusion scores and token similarity scores. (d) Mask Decoding and Propagation (MDP). The <TAK> token provides a sparse embedding for SAM2, generating a keyframe mask and propagating it to other frames via a memory mechanism.
We provide the results based on Chat-UniVi-7B, demonstrating the effectiveness of our proposed method.
Model | MLLM | Referring (J&F) | Reasoning (J&F) | Overall (J&F) |
---|---|---|---|---|
VISA | Chat-UniVi-7B | 50.9 | 43.0 | 46.9 |
VISA | Chat-UniVi-13B | 57.4 | 44.3 | 50.9 |
VRS-HQ | Chat-UniVi-7B | 62.1 | 56.1 | 59.1 |
Please follow the VISA project to download the corresponding image and video datasets. Our data file structure for training and inference will remain consistent with theirs.
First, download the pretrained weights of SAM2 (hiera_large) by running:
cd checkpoints
bash download_ckpts.sh
Second, download the pretrained weights of Chat-UniVi.
Third, download the weights of CLIP-336 for keyframe selection during inference.
We provide the model weights based on Chat-UniVi-7B on the Huggingface and Baidu Drive respectively: Huggingface/Baidu Drive
conda create -n vrshq python=3.10 -y
conda activate vrshq
git clone https://github.com/SitongGong/VRS-HQ
cd VRS-HQ
pip install -e .
# For Chat-UniVi
pip install transformers==4.31.0
Updating
You only need to rectify some configurations and filepath in the evaluation code.
# For ReVOS
CUDA_VISIBLE_DEVICES='0' deepspeed --master_port=24999 evaluation_multiseg.py \
--val_dataset "revos_valid" \
--log_base_dir "/18515601223/segment-anything-2/rvos_results" \
--exp_name "evaluation_revos" \
📑 Todo list
-
[ ✅ ] Release inference code
-
[ ✅ ] Release the model weights of VRS-HQ
-
Release training code
If you find this project useful in your research, please consider citing:
@article{gong2025devil,
title={The Devil is in Temporal Token: High Quality Video Reasoning Segmentation},
author={Gong, Sitong and Zhuge, Yunzhi and Zhang, Lu and Yang, Zongxin and Zhang, Pingping and Lu, Huchuan},
journal={arXiv preprint arXiv:2501.08549},
year={2025}
}
This work is built upon the Chat-UniVi, VISA and SAM2. We sincerely thank these excellent contributions..