MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Haoyu Wu$^{1*}$, Jiwen Yu $^{1}$, Yingtian Zou$^{2}$, Xihui Liu $^{1}†$

$^1$ The University of Hong Kong $^2$ SReal AI

(† Corresponding Author)

🎯 Overview

We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency.

In the Multi-Agent Condition Module (Sec. 3.2), Agent Identity Embedding and Adaptive Action Weighting are employed to achieve multi-agent controllability. In the Global State Encoder (Sec. 3.3), we use a frozen VGGT backbone to extract implicit 3D global environmental information from partial observations, thereby improving multi-view consistency. MultiWorld scales effectively across varying agent counts and camera views, supporting autoregressive inference to generate beyond the training context length (Sec. 3.4).

🚀 News

[2026/5/11] Training code available. Long-term inference Code available.
[2026/4/21] Paper,code,data and project page are available. Welcome to have a try.

Setup Environments

conda create -n multiworld python=3.13 
conda activate multiworld
# install torch 
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
    --index-url https://download.pytorch.org/whl/cu128

pip install -r requirements.txt

Dataset Download

MultiWorld release contains two parts: It Takes Two game videos and Robotics videos.
All .tar archives are stored flat in the same dataset repository.

ModelScope Download

modelscope login <YOUR_API_KEY>
modelscope download --dataset HaoyuWuRUC/MultiWorldData \
    --local_dir ./data
bash preprocess/untar_chunks.sh

HuggingFace Download

hf auth login
hf download Haoyuwu/MultiWorldData --repo-type dataset \
    --local-dir ./data
bash preprocess/untar_chunks.sh

After running preprocess/untar_chunks.sh, the archives are extracted to:

data/ittakestwo_release/ — It Takes Two dataset
data/robots_release/ — Robotics dataset

Checkpoint Download

modelscope login <YOUR_API_KEY>
modelscope download --model HaoyuWuRUC/MultiWorldCheckpoint \
    multiworld_480p_fulldata.safetensors --local_dir ./checkpoints
modelscope download --model HaoyuWuRUC/MultiWorldCheckpoint \
    multiworld_480p_toydata.safetensors --local_dir ./checkpoints
modelscope download --model HaoyuWuRUC/MultiWorldCheckpoint \
    multiworld_320p_robots.safetensors --local_dir ./checkpoints

hf auth login
hf download Haoyuwu/MultiWorldCheckpoint multiworld_480p_fulldata.safetensors --local-dir ./checkpoints --repo-type model
hf download Haoyuwu/MultiWorldCheckpoint multiworld_480p_toydata.safetensors --local-dir ./checkpoints --repo-type model
hf download Haoyuwu/MultiWorldCheckpoint multiworld_320p_robots.safetensors --local-dir ./checkpoints --repo-type model

Training

To train MultiWorld on the full It Takes Two dataset:

bash ittakestwo/scripts/train.sh ittakestwo/configs/train_ua_480P_toy.yaml train_480P

config_path: Path to the training config (e.g., train_ua_480P_toy.yaml for 480p two-agent setting).
output_path: Experiment outputs (checkpoints, logs, eval videos) are saved to outputs/<EXP_NAME>/.
The script uses accelerate launch for distributed training. Adjust nproc_per_node in your accelerate config as needed.

Inference

For non-autoregressive inference, left/right view videos are automatically concatenated side-by-side after generation. The gen/ and gt/ directories will already contain the final concatenated outputs when inference finishes.

It Takes Two

Inference checkpoint trained on full dataset:

python -m torch.distributed.run --nproc_per_node=8 \
    ittakestwo/parallel_inference.py \
    --inference-seed 0 \
    --num-inference-steps 50 \
    --config-path ittakestwo/configs/inference_480P_full.yaml \
    --model-path checkpoints/multiworld_480p_fulldata.safetensors \
    --output-dir outputs/eval_480P_full

Inference checkpoint trained on toy dataset:

python -m torch.distributed.run --nproc_per_node=8 \
    ittakestwo/parallel_inference.py \
    --inference-seed 0 \
    --num-inference-steps 35 \
    --config-path ittakestwo/configs/inference_480P_toy.yaml \
    --model-path checkpoints/multiworld_480p_toydata.safetensors \
    --output-dir outputs/eval_480P_toy

Inference on long video-action dataset (autoregressive):

python -m torch.distributed.run --nproc_per_node=8 \
      ittakestwo/parallel_inference.py \
      --inference-mode autoregressive \
      --num-chunks 3 \
      --config-path ittakestwo/configs/inference_480P_full_long.yaml \
      --model-path checkpoints/multiworld_480p_fulldata.safetensors \
      --output-dir outputs/autoregressive_longvideo

Robotics

Inference on robotics dataset:

python -m torch.distributed.run --nproc_per_node=8 \
    robots/parallel_inference.py \
    --config-path robots/configs/inference.yaml \
    --model-path checkpoints/multiworld_320p_robots.safetensors \
    --output-dir outputs/test_robotics_output

Acknowledgements

This codebase is built on top of the open-source implementation of DiffSynth-Studio, VGGT, RoboFactory and the Wan2.2 repo.

Contact

Welcome to have a discussion on the project and Video World Models. You can find me at through wuhaoyu556@connect.hku.hk.

License

This repository contains both software code and data/assets under separate licenses:

Code (training scripts, models, inference tools): Licensed under Apache-2.0. Free for research and commercial use.
Data & Model Weights (datasets, checkpoints, pre-trained weights): Licensed under CC BY-NC 4.0.
- ✅ Allowed: Academic research, education, non-commercial projects
- ❌ Prohibited: Commercial products
- 📧 Commercial licensing: Contact wuhaoyu556@connect.hku.hk

📜 Citation

If you find our work useful for your research, please consider citing our paper:

@article{wu2025multiworld,
  title={MultiWorld: Scalable Multi-Agent Multi-View Video World Models},
  author={Wu, Haoyu and Yu, Jiwen and Zou, Yingtian and Liu, Xihui},
  journal={arXiv preprint arXiv:2604.18564},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
diffsynth		diffsynth
ittakestwo		ittakestwo
preprocess		preprocess
robots		robots
utils		utils
.gitignore		.gitignore
LICENCE		LICENCE
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

🎯 Overview

🚀 News

Setup Environments

Dataset Download

ModelScope Download

HuggingFace Download

Checkpoint Download

Training

Inference

It Takes Two

Robotics

Acknowledgements

Contact

License

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

🎯 Overview

🚀 News

Setup Environments

Dataset Download

ModelScope Download

HuggingFace Download

Checkpoint Download

Training

Inference

It Takes Two

Robotics

Acknowledgements

Contact

License

📜 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages