Hybrid Training for Vision-Language-Action Models
Mazzaglia, Pietro, Sancaktar, Cansu, Peschl, Markus, Dijkman, Daniel
–arXiv.org Artificial Intelligence
Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model's generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent's actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments. Figure 1: Hybrid Training (HyT) of VLAs increases the agent's performance similarly to ECoT, but also maintains the same fast inference as standard VLAs. Performance refers to the ClevrSkills experiments (9 tasks, 3000 demos) in the Experiments section. Despite recent advances in robotics, truly generalist robot policies have long been elusive. Thanks to the joint efforts of collecting large-scale robot data (O'Neill et al., 2024) and making large Vision Language Models (VLM) open-source (Steiner et al., 2024; Tong et al., 2024), we have entered a new era in robotics foundation models. By fine-tuning VLMs on robotic datasets containing actions, we obtain so-called Vision-Language-Action models (VLAs) (Kim et al., 2024; Brohan et al., 2023b;a): large policy models that are trained end-to-end to take language instructions and raw camera images as inputs, and output low-level robotic actions.
arXiv.org Artificial Intelligence
Oct-2-2025