MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

Zhao, Ruihan, Ingebrand, Tyler, Chinchali, Sandeep, Topcu, Ufuk

arXiv.org Artificial Intelligence 

Vision-Language-Action (VLA) models trained on large robot datasets promise general-purpose, robust control across diverse domains and embodiments. However, existing approaches often fail out-of-the-box when deployed in novel environments, embodiments, or tasks. We introduce Mixture of Skills VLA (MoS-VLA), a framework that represents robot manipulation policies as linear combinations of a finite set of learned basis functions. During pretraining, MoS-VLA jointly learns these basis functions across datasets from the Open X-Embodiment project, producing a structured skill space. At test time, adapting to a new task requires only a single expert demonstration. The corresponding skill representation is then inferred via a lightweight convex optimization problem that minimizes the L1 action error, without requiring gradient updates. Empirically, MoS-VLA achieves lower action-prediction error on five out of five unseen datasets and succeeds in both simulation and real-robot tasks where a pretrained VLA model fails outright. Inspired by the success of large language models, modern robotics aims to achieve generalization and human-like performance through the use of internet-scale data and large, attention-based architectures. To this end, researchers have collected enormous datasets of robotic arm trajectories (Open X-Embodiment Collaboration et al., 2023) and trained so-called vision-language-action foundation models to map natural language task descriptions and state observations to robot actions (Kim et al., 2024; Octo Model Team et al., 2024; Brohan et al., 2023b;a; Ma et al., 2024).