dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
Wen, Junjie, Zhu, Minjie, Liu, Jiaming, Liu, Zhiyuan, Yang, Yicun, Zhang, Linfeng, Zhang, Shanghang, Zhu, Yichen, Xu, Yi
–arXiv.org Artificial Intelligence
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies--a prefix attention mask and key-value (KV) caching--yielding up to 2 speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics. The development of VLA models has undergone two stages of evolution. In the first stage, a pre-trained vision-language backbone is used purely as a feature extractor, and the extracted features are mapped directly to robot actions. As vanilla VLA architectures proved inadequate for open-world instruction following and long-horizon tasks, a second-stage training paradigm co-trains on image-text data alongside action trajectories to preserve knowledge from the pre-trained VLM and, when necessary, to predict both sub-step reasoning and robot actions (Zhou et al., 2025b;a; Intelligence et al., 2025b; Driess et al., 2025).
arXiv.org Artificial Intelligence
Oct-1-2025
- Country:
- Asia > China
- Europe > Netherlands
- South Holland > Delft (0.04)
- North America > Montserrat (0.04)
- Genre:
- Research Report > New Finding (0.48)
- Technology:
- Information Technology > Artificial Intelligence > Robots (1.00)