VRS-HQ

Official code for "The Devil is in Temporal Token: High Quality Video Reasoning Segmentation" CVPR 2025! 🎉

🔥 Our Approach

Motivation

Comparison with previous VRS approaches. (a) Previous methods utilize a single <SEG> token for keyframe-based segmentation, depending heavily on external models for keyframe detection and mask propagation. This reliance can hinder accurate keyframe localization and prevent end-to-end inference. (b) VRS-HQ introduces frame-level <SEG> and a temporal <TAK> token for dynamic aggregation. The aggregated <TAK> token is then used for both keyframe selection and mask generation within SAM2. This enables single-stage inference with precise keyframe selection and high-quality segmentation. (c) VRS-HQ achieves state-of-the-art performance on various image and video datasets across reasoning and referring segmentation.

Overall Pipeline

(a) VRS-HQ architecture. VRS-HQ incorporates a Multimodal Large Language Model for Temporal Token Encoding (<SEG> and <TAK> tokens), a Temporal Dynamic Aggregation, a Token-driven Keyframe Selection and Mask Decoding and Propogation. (b) Temporal Dynamic Aggregation (TDA) merges frame-level <SEG> tokens into a temporal <TAK> token using a weighted fusion based on cosine similarity. (c) Token-driven Keyframe Selection (TKS). During training, the frame with the <SEG> token closest to the <TAK> token is selected as the keyframe. During inference, keyframe selection is refined using SAM2's occlusion scores and token similarity scores. (d) Mask Decoding and Propagation (MDP). The <TAK> token provides a sparse embedding for SAM2, generating a keyframe mask and propagating it to other frames via a memory mechanism.

Performance

We provide the results based on Chat-UniVi-7B, demonstrating the effectiveness of our proposed method.

Model	MLLM	Referring (J&F)	Reasoning (J&F)	Overall (J&F)
VISA	Chat-UniVi-7B	50.9	43.0	46.9
VISA	Chat-UniVi-13B	57.4	44.3	50.9
VRS-HQ	Chat-UniVi-7B	62.1	56.1	59.1

🛠️ Getting Started

Dataset Preparation

Please follow the VISA project to download the corresponding image and video datasets. Our data file structure for training and inference will remain consistent with theirs.

Pretrained Weights

Training

First, download the pretrained weights of SAM2 (hiera_large) by running:

cd checkpoints
bash download_ckpts.sh

Second, download the pretrained weights of Chat-UniVi.

Third, download the weights of CLIP-336 for keyframe selection during inference.

Inference

We provide the model weights based on Chat-UniVi-7B on the Huggingface and Baidu Drive respectively: Huggingface/Baidu Drive

Installation

conda create -n vrshq python=3.10 -y
conda activate vrshq
git clone https://github.com/SitongGong/VRS-HQ
cd VRS-HQ
pip install -e .
# For Chat-UniVi
pip install transformers==4.31.0

Validation

Updating

Running VRSHQ to generate masks

You only need to rectify some configurations and filepath in the evaluation code.

# For ReVOS
CUDA_VISIBLE_DEVICES='0' deepspeed --master_port=24999 evaluation_multiseg.py \
    --val_dataset "revos_valid" \
    --log_base_dir "/18515601223/segment-anything-2/rvos_results" \
    --exp_name "evaluation_revos" \

Using tools to calculate the metrics

📑 Todo list

[ ✅ ] Release inference code
[ ✅ ] Release the model weights of VRS-HQ
Release training code

🌟 Cite

If you find this project useful in your research, please consider citing:

@article{gong2025devil,
        title={The Devil is in Temporal Token: High Quality Video Reasoning Segmentation},
        author={Gong, Sitong and Zhuge, Yunzhi and Zhang, Lu and Yang, Zongxin and Zhang, Pingping and Lu, Huchuan},
        journal={arXiv preprint arXiv:2501.08549},
        year={2025}
        }

🎖️ Acknowledgement

This work is built upon the Chat-UniVi, VISA and SAM2. We sincerely thank these excellent contributions..

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
checkpoints		checkpoints
dataset		dataset
model		model
sam2		sam2
README.md		README.md
evaluation_multiseg.py		evaluation_multiseg.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VRS-HQ

🔥 Our Approach

Motivation

Overall Pipeline

Performance

🛠️ Getting Started

Dataset Preparation

Pretrained Weights

Training

Inference

Installation

Validation

Running VRSHQ to generate masks

Using tools to calculate the metrics

🌟 Cite

🎖️ Acknowledgement

About

Releases

Packages

Languages

SitongGong/VRS-HQ

Folders and files

Latest commit

History

Repository files navigation

VRS-HQ

🔥 Our Approach

Motivation

Overall Pipeline

Performance

🛠️ Getting Started

Dataset Preparation

Pretrained Weights

Training

Inference

Installation

Validation

Running VRSHQ to generate masks

Using tools to calculate the metrics

🌟 Cite

🎖️ Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages