Discrete Diffusion VLA

HuggingFace | Paper | Reddit

This is the official PyTorch implementation of paper:

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

Under submission

Update

[New!] 2025-12-03 We also provide support for LIBERO in Huawei NPU. Please check this branch. Thanks for all the supporters.

For LIBERO

This is an implementation with basic logics of our discrete diffusion VLA on LIBERO benchmark. We will release the whole parts after the paper acceptance. Thank you.

Setup

See SETUP.md for instructions on setting up the conda environment.

See LIBERO.md for fine-tuning/evaluating on LIBERO simulation benchmark task suites.

We also provide support for LIBERO in libero_NPU. Our method achieves 98.6% on LIBERO-Object using Ascend 910B.

Finetuning and Evaluation

Please refer to finetune.sh and finetune_from_ckpt.sh for finetuning. We fine-tuned OpenVLA via LoRA (r=32) on four LIBERO task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-10 (also called LIBERO-Long).

And please refer to scripts/eval_libero_object_batch.sh for evaluation. Notes:

The evaluation script will run 500 trials by default (10 tasks x 50 episodes each). You can modify the number of trials per task by setting --num_trials_per_task. You can also change the random seed via --seed.
NOTE: Setting --center_crop True is important because we fine-tuned OpenVLA with random crop augmentations (we took a random crop with 90% area in every training sample, so at test time we simply take the center 90% crop).
The evaluation script logs results locally. You can also log results in Weights & Biases by setting --use_wandb True and specifying --wandb_project <PROJECT> and --wandb_entity <ENTITY>.
Note that results may vary slightly if you use a different GPU than the A100.
Please be sure to test your policy with the same device/GPU used to train it! Otherwise, performance may drop substantially. You may be able to avoid the performance drop if you merge the LoRA weights into the base model on the downstream device used for testing (e.g., if you train on H100 and then merge on A100 before testing on A100). You can see our script vla-scripts/merge_lora_weights_and_save.py for merging the LoRA adapter into the base model offline. It's okay if you already merged LoRA weights into the base OpenVLA model during fine-tuning; you can always redownload the base model and merge again as long as you still have the LoRA adapter (merge_lora_weights_and_save.py will handle this for you).

Citation

If you find this code useful for your research, please use the following BibTeX entry.

@article{liang2025discrete,
  title={Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies},
  author={Liang, Zhixuan and Li, Yizhuo and Yang, Tianshuo and Wu, Chengyue and Mao, Sitong and Pei, Liuao and Yang, Xiaokang and Pang, Jiangmiao and Mu, Yao and Luo, Ping},
  journal={arXiv preprint arXiv:2508.20072},
  year={2025}
}

Acknowledgements

The implementation is mostly based on Moo Jin Kim's openvla-oft repo. We thank the authors for their great works.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
experiments/robot		experiments/robot
prismatic		prismatic
scripts		scripts
vla-scripts		vla-scripts
.gitignore		.gitignore
ALOHA.md		ALOHA.md
LIBERO.md		LIBERO.md
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
finetune.sh		finetune.sh
finetune_from_ckpt.sh		finetune_from_ckpt.sh
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Discrete Diffusion VLA

HuggingFace | Paper | Reddit