RealMirror: A Comprehensive, Open-Source Vision-Language-Action Platform for Embodied AI

Tai, Cong, Zheng, Zhaoyu, Long, Haixu, Wu, Hansheng, Xiang, Haodong, Long, Zhengbin, Xiong, Jun, Shi, Rong, Zhang, Shizhuang, Qiu, Gang, Wang, He, Li, Ruifeng, Huang, Jun, Chang, Bin, Feng, Shuai, Shen, Tao

arXiv.org Artificial Intelligence 

Abstract-- The emerging field of Vision-Language-Action (VLA) for humanoid robots faces several fundamental challenges, including the high cost of data acquisition, the lack of a standardized benchmark, and the significant gap between simulation and the real world. T o overcome these obstacles, we propose RealMirror, a comprehensive, open-source embodied AI VLA platform. RealMirror builds an efficient, low-cost data collection, model training, and inference system that enables end-to-end VLA research without requiring a real robot. T o facilitate model evolution and fair comparison, we also introduce a dedicated VLA benchmark for humanoid robots, featuring multiple scenarios, extensive trajectories, and various VLA models. Jun Xiong is with The Chinese University of Hong Kong, Shenzhen, China. In conclusion, with the unification of these critical components, RealMirror provides a robust framework that significantly accelerates the development of VLA models for humanoid robots. I. INTRODUCTION The rapid evolution of Large Language Models (LLMs) like GPT [1], Qwen [2], and Deepseek [3] has significantly advanced the development of Artificial General Intelligence (AGI). While exhibiting remarkable model performance, they lack the ability to perform tasks in the real world.