virtualhome
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning Anonymous Author(s) Affiliation Address email Appendix 1 A Experimental environments 2 We use the VirtualHome simulator [
A.1 List of objects, containers, surfaces, and rooms in the apartment We list all the objects that are included in our experimental environment. We use the object rearrangement tasks for evaluation. The tasks are randomly sampled from different distributions. Simple: this task is to move one object in the house to the desired location. Novel Simple: this task is to move one object in the house to the desired location.
HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models
Chakraborty, Trishna, Ghosh, Udita, Zhang, Xiaopan, Niloy, Fahim Faisal, Dong, Yue, Li, Jiachen, Roy-Chowdhury, Amit K., Song, Chengyu
Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene-task inconsistencies. Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40x higher than base prompts. Evaluating 12 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies-highlighting fundamental limitations in handling infeasible tasks. We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies.
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning Anonymous Author(s) Affiliation Address email Appendix 1 A Experimental environments 2 We use the VirtualHome simulator [
A.1 List of objects, containers, surfaces, and rooms in the apartment We list all the objects that are included in our experimental environment. We use the object rearrangement tasks for evaluation. The tasks are randomly sampled from different distributions. Simple: this task is to move one object in the house to the desired location. Novel Simple: this task is to move one object in the house to the desired location.
LERa: Replanning with Visual Feedback in Instruction Following
Pchelintsev, Svyatoslav, Patratskiy, Maxim, Onishchenko, Anatoly, Korchemnyi, Alexandr, Medvedev, Aleksandr, Vinogradova, Uliana, Galuzinsky, Ilya, Postnikov, Aleksey, Kovalev, Alexey K., Panov, Aleksandr I.
Abstract-- Large Language Models are increasingly used in robotics for task planning, but their reliance on textual inputs limits their adaptability to real-world changes and failures. T o address these challenges, we propose LERa -- L ook, E xplain, R epla n -- a Visual Language Model-based replanning approach that utilizes visual feedback. Unlike existing methods, LERa requires only a raw RGB image, a natural language instruction, an initial task plan, and failure detection -- without additional information such as object detection or predefined conditions that may be unavailable in a given scenario. The replanning process consists of three steps: (i) Look -- where LERa generates a scene description and identifies errors; (ii) Explain -- where it provides corrective guidance; and (iii) Replan -- where it modifies the plan accordingly. LERa is adaptable to various agent architectures and can handle errors from both dynamic scene changes and task execution failures. We evaluate LERa on the newly introduced ALFRED-ChaOS and VirtualHome-ChaOS datasets, achieving a 40% improvement over baselines in dynamic environments. In tabletop manipulation tasks with a predefined probability of task failure within the PyBullet simulator, LERa improves success rates by up to 67%. Further experiments, including real-world trials with a tabletop manipulator robot, confirm LERa's effectiveness in replanning. We demonstrate that LERa is a robust and adaptable solution for error-aware task execution in robotics. The project page is available at https://lera-robo.github.io. I. INTRODUCTION Large Language Models (LLMs) trained on Internet-scale data can solve problems that they were not originally designed for [1].
World Model Implanting for Test-time Adaptation of Embodied Agents
Yoo, Minjong, Jang, Jinwoo, Yoon, Sihyung, Woo, Honguk
In embodied AI, a persistent challenge is enabling agents to robustly adapt to novel domains without requiring extensive data collection or retraining. To address this, we present a world model implanting framework (WorMI) that combines the reasoning capabilities of large language models (LLMs) with independently learned, domain-specific world models through test-time composition. By allowing seamless implantation and removal of the world models, the embodied agent's policy achieves and maintains cross-domain adaptability. In the WorMI framework, we employ a prototype-based world model retrieval approach, utilizing efficient trajectory-based abstract representation matching, to incorporate relevant models into test-time composition. We also develop a world-wise compound attention method that not only integrates the knowledge from the retrieved world models but also aligns their intermediate representations with the reasoning model's representation within the agent's policy. This framework design effectively fuses domain-specific knowledge from multiple world models, ensuring robust adaptation to unseen domains. We evaluate our WorMI on the VirtualHome and ALFWorld benchmarks, demonstrating superior zero-shot and few-shot performance compared to several LLM-based approaches across a range of unseen domains. These results highlight the frameworks potential for scalable, real-world deployment in embodied agent scenarios where adaptability and data efficiency are essential.