Grounding Large Language Models In Embodied Environment With Imperfect World Models

Nov-11-2024–arXiv.org Artificial Intelligence

Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04, 1.54, and 1.82 across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4. Recent advances in Large Language Models (LLMs) is bringing great transformation in various robotics applications, such as self-driving cars (Mao et al. (2023)), autonomous drones (Vemprala et al. (2023)), robotics manipulation (Liang et al. (2022)). LLMs are able to enhance robots with rich common-sense knowledge and complex planning capabilities. However, LLM needs to be physically grounded in reality, which includes understanding of the environment dynamics, the task-related constraints, and the consequences of its actions (Gao et al. (2023); Rana et al. (2023)). Many previous works in robot learning heavily rely on prompting, such as (1) decomposing problem structures using human priors (Rana et al. (2023); Liang et al. (2022)), (2) self-refinement (Zhang et al. (2023); Wang et al. (2023a)) and (3) external tools (Mao et al. (2023)). This approach does not alter the weights of the model, instead relying on the pretrained knowledge of the LLMs. However, LLMs are trained with text corpus, lacking the understanding of the fine-grained semantics in physical environments. It also suffers from hallucination problems (Rawte et al. (2023)), and difficulties with understanding time-aware actions (Dhingra et al. (2021)). Moreover, the "heavy prompting" approach often proves effective only in small-scale environments, like a predefined room with fixed sets of objects and available actions (Rana et al. (2023)).

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Nov-11-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - UAE (0.14)
- North America > United States
  - California (0.14)

Genre:
- Research Report (1.00)

Industry:
- Transportation > Ground > Road (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)