VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

Zhang, Shiduo, Xu, Zhe, Liu, Peiju, Yu, Xiaopeng, Li, Yuan, Gao, Qinghui, Fei, Zhaoye, Yin, Zhangyue, Wu, Zuxuan, Jiang, Yu-Gang, Qiu, Xipeng

Dec-24-2024–arXiv.org Artificial Intelligence

General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh\&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Dec-24-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report (1.00)
- Workflow (0.88)

Industry:
- Education (0.92)
- Leisure & Entertainment > Games (0.46)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.45)
- Energy > Oil & Gas
  - Upstream (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)