Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Yuan, Haoqi, Bai, Yu, Fu, Yuhui, Zhou, Bohan, Feng, Yicheng, Xu, Xinrun, Zhan, Yi, Karlsson, Börje F., Lu, Zongqing

Mar-16-2025–arXiv.org Artificial Intelligence

Building autonomous robotic agents capable of achieving human-level performance in real-world embodied tasks is an ultimate goal in humanoid robot research. Recent advances have made significant progress in high-level cognition with Foundation Models (FMs) and low-level skill development for humanoid robots. However, directly combining these components often results in poor robustness and efficiency due to compounding errors in long-horizon tasks and the varied latency of different modules. We introduce Being-0, a hierarchical agent framework that integrates an FM with a modular skill library. The FM handles high-level cognitive tasks such as instruction understanding, task planning, and reasoning, while the skill library provides stable locomotion and dexterous manipulation for low-level control. To bridge the gap between these levels, we propose a novel Connector module, powered by a lightweight vision-language model (VLM). The Connector enhances the FM's embodied capabilities by translating language-based plans into actionable skill commands and dynamically coordinating locomotion and manipulation to improve task success. With all components, except the FM, deployable on low-cost onboard computation devices, Being-0 achieves efficient, real-time performance on a full-sized humanoid robot equipped with dexterous hands and active vision. Extensive experiments in large indoor environments demonstrate Being-0's effectiveness in solving complex, long-horizon tasks that require challenging navigation and manipulation subtasks. For further details and videos, visit https://beingbeyond.github.io/being-0.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Mar-16-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)
- Workflow (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.68)
  - Representation & Reasoning > Agents (0.46)
  - Robots
    - Locomotion (0.67)
    - Robot Planning & Action (0.48)
  - Natural Language
    - Large Language Model (0.68)
    - Chatbot (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found