Skip to content

SitongGong/VRS-HQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VRS-HQ

Official code for "The Devil is in Temporal Token: High Quality Video Reasoning Segmentation" CVPR 2025! 🎉

PWC PWC PWC

🔥 Our Approach

Motivation

Comparison with previous VRS approaches. (a) Previous methods utilize a single <SEG> token for keyframe-based segmentation, depending heavily on external models for keyframe detection and mask propagation. This reliance can hinder accurate keyframe localization and prevent end-to-end inference. (b) VRS-HQ introduces frame-level <SEG> and a temporal <TAK> token for dynamic aggregation. The aggregated <TAK> token is then used for both keyframe selection and mask generation within SAM2. This enables single-stage inference with precise keyframe selection and high-quality segmentation. (c) VRS-HQ achieves state-of-the-art performance on various image and video datasets across reasoning and referring segmentation.

Overall Pipeline

(a) VRS-HQ architecture. VRS-HQ incorporates a Multimodal Large Language Model for Temporal Token Encoding (<SEG> and <TAK> tokens), a Temporal Dynamic Aggregation, a Token-driven Keyframe Selection and Mask Decoding and Propogation. (b) Temporal Dynamic Aggregation (TDA) merges frame-level <SEG> tokens into a temporal <TAK> token using a weighted fusion based on cosine similarity. (c) Token-driven Keyframe Selection (TKS). During training, the frame with the <SEG> token closest to the <TAK> token is selected as the keyframe. During inference, keyframe selection is refined using SAM2's occlusion scores and token similarity scores. (d) Mask Decoding and Propagation (MDP). The <TAK> token provides a sparse embedding for SAM2, generating a keyframe mask and propagating it to other frames via a memory mechanism.

Performance

We provide the results based on Chat-UniVi-7B, demonstrating the effectiveness of our proposed method.

Model MLLM Referring (J&F) Reasoning (J&F) Overall (J&F)
VISA Chat-UniVi-7B 50.9 43.0 46.9
VISA Chat-UniVi-13B 57.4 44.3 50.9
VRS-HQ Chat-UniVi-7B 62.1 56.1 59.1

🛠️ Getting Started

Dataset Preparation

Please follow the VISA project to download the corresponding image and video datasets. Our data file structure for training and inference will remain consistent with theirs.

Pretrained Weights

Training

First, download the pretrained weights of SAM2 (hiera_large) by running:

cd checkpoints
bash download_ckpts.sh

Second, download the pretrained weights of Chat-UniVi.

Third, download the weights of CLIP-336 for keyframe selection during inference.

Inference

We provide the model weights based on Chat-UniVi-7B on the Huggingface and Baidu Drive respectively: Huggingface/Baidu Drive

Installation

conda create -n vrshq python=3.10 -y
conda activate vrshq
git clone https://github.com/SitongGong/VRS-HQ
cd VRS-HQ
pip install -e .
# For Chat-UniVi
pip install transformers==4.31.0

Validation

Updating

Running VRSHQ to generate masks

You only need to rectify some configurations and filepath in the evaluation code.

# For ReVOS
CUDA_VISIBLE_DEVICES='0' deepspeed --master_port=24999 evaluation_multiseg.py \
    --val_dataset "revos_valid" \
    --log_base_dir "/18515601223/segment-anything-2/rvos_results" \
    --exp_name "evaluation_revos" \

Using tools to calculate the metrics

📑 Todo list
  • [ ] Release inference code

  • [ ] Release the model weights of VRS-HQ

  • Release training code

🌟 Cite

If you find this project useful in your research, please consider citing:

@article{gong2025devil,
        title={The Devil is in Temporal Token: High Quality Video Reasoning Segmentation},
        author={Gong, Sitong and Zhuge, Yunzhi and Zhang, Lu and Yang, Zongxin and Zhang, Pingping and Lu, Huchuan},
        journal={arXiv preprint arXiv:2501.08549},
        year={2025}
        }

🎖️ Acknowledgement

This work is built upon the Chat-UniVi, VISA and SAM2. We sincerely thank these excellent contributions..

About

High Quality Video Reasoning Segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages