VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
Liu, Xiaoyu, Fu, Chaoyou, Yan, Chi, Wu, Chu, Gao, Haihan, Zhang, Yi-Fan, Dong, Shaoqi, Qian, Cheng, Luo, Bin, Yang, Xiuyong, Li, Guanwu, Cai, Yusheng, Shen, Yunhang, Jiang, Deqiang, Cao, Haoyu, Sun, Xing, Shan, Caifeng, He, Ran
–arXiv.org Artificial Intelligence
Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless human-robot collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VIT A-E, a novel human-robot interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an "Active Model" and a "Standby Model", allowing the robot to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and inter-ruptibly, mimicking human-like multitasking capabilities. We further propose a "model-as-controller" paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid robot demonstrate that VIT A-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable robotic assistants. Achieving this level of seamless multimodal coordination is the defining aspiration for our ideal general-purpose robot. However, the predominant focus of the field has been on improving the success rate of specific, static tasks, often overlooking a critical dimension of autonomy: the ability to engage in continuous, natural, and dynamic collaboration with a human user in complex scenarios (Abbo et al., 2025; Fong et al., 2003). An ideal robotic assistant should not be a silent executor of commands but a collaborative partner, which encompasses maintaining continuous visual perception, processing auditory inputs, generating verbal responses, and executing physical actions in parallel (e.g., answering, "Is the bookshelf tidied up?" while organizing a room) and dynamically adapting to new directives that reflect a changing environment (e.g., "Don't clean the bedroom yet--the baby is sleeping."). Such concurrent multitasking and dynamic response is fundamental to enabling natural human-robot collaboration. 1 Please see our demo video at this Y ouT ube link.
arXiv.org Artificial Intelligence
Oct-28-2025
- Country:
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Technology: