HuggingFace | Paper | Reddit
This is the official PyTorch implementation of paper:
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo
Under submission
- [New!] 2025-12-03 We also provide support for LIBERO in Huawei NPU. Please check this branch. Thanks for all the supporters.
This is an implementation with basic logics of our discrete diffusion VLA on LIBERO benchmark. We will release the whole parts after the paper acceptance. Thank you.
See SETUP.md for instructions on setting up the conda environment.
See LIBERO.md for fine-tuning/evaluating on LIBERO simulation benchmark task suites.
We also provide support for LIBERO in libero_NPU. Our method achieves 98.6% on LIBERO-Object using Ascend 910B.
Please refer to finetune.sh and finetune_from_ckpt.sh for finetuning. We fine-tuned OpenVLA via LoRA (r=32) on four LIBERO task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-10 (also called LIBERO-Long).
And please refer to scripts/eval_libero_object_batch.sh for evaluation. Notes:
- The evaluation script will run 500 trials by default (10 tasks x 50 episodes each). You can modify the number of
trials per task by setting
--num_trials_per_task. You can also change the random seed via--seed. - NOTE: Setting
--center_crop Trueis important because we fine-tuned OpenVLA with random crop augmentations (we took a random crop with 90% area in every training sample, so at test time we simply take the center 90% crop). - The evaluation script logs results locally. You can also log results in Weights & Biases
by setting
--use_wandb Trueand specifying--wandb_project <PROJECT>and--wandb_entity <ENTITY>. - Note that results may vary slightly if you use a different GPU than the A100.
- Please be sure to test your policy with the same device/GPU used to train it! Otherwise, performance may drop substantially. You may be able to avoid the performance drop if you merge the LoRA weights into the base model on the downstream device used for testing (e.g., if you train on H100 and then merge on A100 before testing on A100). You can see our script vla-scripts/merge_lora_weights_and_save.py for merging the LoRA adapter into the base model offline. It's okay if you already merged LoRA weights into the base OpenVLA model during fine-tuning; you can always redownload the base model and merge again as long as you still have the LoRA adapter (
merge_lora_weights_and_save.pywill handle this for you).
If you find this code useful for your research, please use the following BibTeX entry.
@article{liang2025discrete,
title={Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies},
author={Liang, Zhixuan and Li, Yizhuo and Yang, Tianshuo and Wu, Chengyue and Mao, Sitong and Pei, Liuao and Yang, Xiaokang and Pang, Jiangmiao and Mu, Yao and Luo, Ping},
journal={arXiv preprint arXiv:2508.20072},
year={2025}
}The implementation is mostly based on Moo Jin Kim's openvla-oft repo. We thank the authors for their great works.
