VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Liu, Xiaoyu, Fu, Chaoyou, Yan, Chi, Wu, Chu, Gao, Haihan, Zhang, Yi-Fan, Dong, Shaoqi, Qian, Cheng, Luo, Bin, Yang, Xiuyong, Li, Guanwu, Cai, Yusheng, Shen, Yunhang, Jiang, Deqiang, Cao, Haoyu, Sun, Xing, Shan, Caifeng, He, Ran

arXiv.org Artificial Intelligence 

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless human-robot collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VIT A-E, a novel human-robot interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an "Active Model" and a "Standby Model", allowing the robot to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and inter-ruptibly, mimicking human-like multitasking capabilities. We further propose a "model-as-controller" paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid robot demonstrate that VIT A-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable robotic assistants. Achieving this level of seamless multimodal coordination is the defining aspiration for our ideal general-purpose robot. However, the predominant focus of the field has been on improving the success rate of specific, static tasks, often overlooking a critical dimension of autonomy: the ability to engage in continuous, natural, and dynamic collaboration with a human user in complex scenarios (Abbo et al., 2025; Fong et al., 2003). An ideal robotic assistant should not be a silent executor of commands but a collaborative partner, which encompasses maintaining continuous visual perception, processing auditory inputs, generating verbal responses, and executing physical actions in parallel (e.g., answering, "Is the bookshelf tidied up?" while organizing a room) and dynamically adapting to new directives that reflect a changing environment (e.g., "Don't clean the bedroom yet--the baby is sleeping."). Such concurrent multitasking and dynamic response is fundamental to enabling natural human-robot collaboration. 1 Please see our demo video at this Y ouT ube link.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found