Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists [Weights and Dataset]
- 📄 Overview
- 📜 Abstract
- ✨ Key Features
- 🏗️ Dataset Construction
- ✂️ Editing Tasks
- 📚 Paper Content
- 🛠️ How to Use
- 👥 Authors
- 🔖 Citation
- 📧 Contact
Señorita-2M is a comprehensive and high-quality dataset designed for general video editing tasks. It consists of a vast collection of videos with detailed instructions provided by video specialists.
Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Señorita-2M, a high-quality video editing dataset. Señorita-2M consists of approximately 2 million video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results.
- High-Quality Annotations: Each video in the dataset is accompanied by edited videos from professional video editors.
- Diverse Editing Tasks: The dataset covers a wide range of video editing tasks, 18 editing tasks in total, including object removal, object swap,object addition, global and local stylization.
- Large Scale: With over 2 million video pairs, Señorita-2M is one of the largest video editing datasets available.
We built the dataset by leveraging high-quality video editing experts. Specifically, we trained four high-quality video editing experts using CogVideoX: a global stylizer, a local stylizer, an inpainting expert, and a super-resolution expert.
Furthermore, we trained multiple video editors based on different video editing architectures using this dataset to evaluate the effectiveness of various editing frameworks, ultimately achieving impressive results.
Our dataset consists of 18 editing tasks. Five of these tasks are edited by our trained experts, while the remaining tasks are handled by computer vision tasks. The former sub-dataset occupies around 70% of the total dataset size.
The dataset construction pipeline involves several stages, including data collection, annotation, and quality verification. We crawled videos from Pexels, a video-sharing website with high-resolution and high-quality videos, by authenticated APIs. The total number of videos in this part is around 390,000. Each video clip is meticulously annotated by video specialists to ensure the highest quality. The captioning of videos is handled by BLIP-2 to cater to the length restriction of CLIP, while the mask regions and their corresponding phrases are obtained by CogVLM2 and Grounded-SAM2.
Global stylization involves applying a consistent style across the entire video. This task is performed by the global stylizer trained using CogVideoX, which ensures a uniform look and feel throughout the video. The video ControlNet uses multiple control conditions to get robust style transfer results, including Canny, HED, and Depth, each transformed into latent space via 3D-VAE.
Local stylization focuses on specific regions within the video, allowing for more detailed and localized effects. Inspired by the inpainting methods, such as AVID, we trained a local stylizer using both inpainting and ControlNet. The model uses three control conditions, same as the global stylizer, inputted into the ControlNet branch. Besides, the mask conditions are fed into the main branch. The pretrained model used is CogVideoX-2B.
Object removal is a common video editing task where unwanted objects are seamlessly removed from the video. Our inpainting expert is trained to handle this task efficiently, ensuring that the background is accurately reconstructed. Current video inpainters like Propinater generate blur when removing objects, which highly reduces its usability. Thus, we trained a powerful video remover based on CogVideoX-2B, using a novel mask selection strategy.
Object swap involves replacing one object with another within the video. This complex task is managed by our trained video editors, who ensure that the new object blends seamlessly with the surrounding environment. Object swap uses FLUX-Fill and our trained inpainter. To begin with, the LLaMA-3 suggests a replacement object, which is then swapped in the first frame by FLUX-Fill. The inpainter generates the remaining frames guided by the first.
First, install the required dependencies using the following command:
pip3 install -r requirements.txt
To download the pretrained model, run the following shell scripts:
huggingface-cli download --resume-download THUDM/CogVideoX-5b-I2V --local-dir ./cogvideox-5b-i2v
huggingface-cli download --resume-download PengWeixuanSZU/Senorita-2M --local-dir ./
To run the application, use the following command:
CUDA_VISIBLE_DEVICES=0 python3 app.py
- Bojia Zi*, The Chinese University of Hong Kong
- Penghui Ruan*, The Hong Kong Polytechnic University
- Marco Chen, Tsinghua University
- Xianbiao Qi†, IntelliFusion Inc.
- Shaozhe Hao, The University of Hong Kong
- Shihao Zhao, The University of Hong Kong
- Youze Huang, University of Electronic Science and Technology of China
- Bin Liang, The Chinese University of Hong Kong
- Rong Xiao, IntelliFusion Inc.
- Kam-Fai Wong, The Chinese University of Hong Kong
Note: * indicates equal contribution. † indicates the corresponding author. The demo codes were developed by Weixuan Peng.
If you use Señorita-2M in your research, please cite our work as follows:
@article{zi2025senorita,
title={Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists},
author={Bojia Zi and Penghui Ruan and Marco Chen and Xianbiao Qi and Shaozhe Hao and Shihao Zhao and Youze Huang and Bin Liang and Rong Xiao and Kam-Fai Wong},
journal={arXiv preprint arXiv:2502.06734},
year={2025},
}
For more information or any queries regarding the dataset, please contact us at [email protected].