| Project Page | Paper |
This is a version of AC3D built on CogVideoX. AC3D is a camera-controlled video generation pipeline that follows the plucker-conditioned ControlNet architecture originally introduced in VD3D.
Install PyTorch first (we used PyTorch 2.4.0 with CUDA 12.4).
pip install -r requirements.txt
Prepare the RealEstate10K dataset following the instructions in CameraCtrl. The dataset path will be used for video_root_dir in the train and inference scripts. This is the folder structure after pre-processing:
- RealEstate10k
- annotations
- test.json
- train.json
- pose_files
- 0000cc6d8b108390.txt
- 00028da87cc5a4c4.txt
- ...
- video_clips
- 0000cc6d8b108390.mp4
- 00028da87cc5a4c4.mp4
- ...
AC3D: CogVideoX-2B: Checkpoint
AC3D: CogVideoX-5B: Checkpoint
AC3D: CogVideoX-2B
bash scripts/inference_2b.sh
AC3D: CogVideoX-5B
bash scripts/inference_5b.sh
The 2B model requires 48 GB memory and the 5B model requires 80 GB memory. Using one node with 8xA100 80 GB should take around 1-2 days for the model to converge.
AC3D: CogVideoX-2B
bash scripts/train_2b.sh
AC3D: CogVideoX-5B
bash scripts/train_5b.sh
- This code mainly builds upon CogVideoX-ControlNet
- This code uses the original CogVideoX model CogVideoX
- The data procesing and data loading pipeline builds upon CameraCtrl
@article{bahmani2024ac3d,
author = {Bahmani, Sherwin and Skorokhodov, Ivan and Qian, Guocheng and Siarohin, Aliaksandr and Menapace, Willi and Tagliasacchi, Andrea and Lindell, David B. and Tulyakov, Sergey},
title = {AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers},
journal = {arXiv preprint arXiv:2411.18673},
year = {2024},
}