Goto

Collaborating Authors

 kitchen cabinet


Large Language Models as Commonsense Knowledge for Large-Scale Task Planning Anonymous Author(s) Affiliation Address email Appendix 1 A Experimental environments 2 We use the VirtualHome simulator [

Neural Information Processing Systems

A.1 List of objects, containers, surfaces, and rooms in the apartment We list all the objects that are included in our experimental environment. We use the object rearrangement tasks for evaluation. The tasks are randomly sampled from different distributions. Simple: this task is to move one object in the house to the desired location. Novel Simple: this task is to move one object in the house to the desired location.


ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning

Choi, Jae-Woo, Kim, Hyungmin, Ong, Hyobin, Jang, Minsu, Kim, Dohyung, Kim, Jaehong, Yoon, Youngwoo

arXiv.org Artificial Intelligence

Recent advancements in large language models (LLMs) have enabled significant progress in decision-making and task planning for embodied autonomous agents. However, most existing methods still struggle with complex, long-horizon tasks because they rely on a monolithic trajectory that entangles all past decisions and observations, attempting to solve the entire task in a single unified process. To address this limitation, we propose ReAcTree, a hierarchical task-planning method that decomposes a complex goal into more manageable subgoals within a dynamically constructed agent tree. Each subgoal is handled by an LLM agent node capable of reasoning, acting, and further expanding the tree, while control flow nodes coordinate the execution strategies of agent nodes. In addition, we integrate two complementary memory systems: each agent node retrieves goal-specific, subgoal-level examples from episodic memory and shares environment-specific observations through working memory. Experiments on the WAH-NL and ALFRED datasets demonstrate that ReAcTree consistently outperforms strong task-planning baselines such as ReAct across diverse LLMs. Notably, on WAH-NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct's 31%.


Large Language Models as Commonsense Knowledge for Large-Scale Task Planning Anonymous Author(s) Affiliation Address email Appendix 1 A Experimental environments 2 We use the VirtualHome simulator [

Neural Information Processing Systems

A.1 List of objects, containers, surfaces, and rooms in the apartment We list all the objects that are included in our experimental environment. We use the object rearrangement tasks for evaluation. The tasks are randomly sampled from different distributions. Simple: this task is to move one object in the house to the desired location. Novel Simple: this task is to move one object in the house to the desired location.


Semantic Skill Grounding for Embodied Instruction-Following in Cross-Domain Environments

Shin, Sangwoo, Kim, Seunghyun, Jang, Youngsoo, Lee, Moontae, Woo, Honguk

arXiv.org Artificial Intelligence

In embodied instruction-following (EIF), the integration of pretrained language models (LMs) as task planners emerges as a significant branch, where tasks are planned at the skill level by prompting LMs with pretrained skills and user instructions. However, grounding these pretrained skills in different domains remains challenging due to their intricate entanglement with the domain-specific knowledge. To address this challenge, we present a semantic skill grounding (SemGro) framework that leverages the hierarchical nature of semantic skills. SemGro recognizes the broad spectrum of these skills, ranging from short-horizon low-semantic skills that are universally applicable across domains to long-horizon rich-semantic skills that are highly specialized and tailored for particular domains. The framework employs an iterative skill decomposition approach, starting from the higher levels of semantic skill hierarchy and then moving downwards, so as to ground each planned skill to an executable level within the target domain. To do so, we use the reasoning capabilities of LMs for composing and decomposing semantic skills, as well as their multi-modal extension for assessing the skill feasibility in the target domain. Our experiments in the VirtualHome benchmark show the efficacy of SemGro in 300 cross-domain EIF scenarios.


MMToM-QA: Multimodal Theory of Mind Question Answering

Jin, Chuanyang, Wu, Yutong, Cao, Jing, Xiang, Jiannan, Kuo, Yen-Ling, Hu, Zhiting, Ullman, Tomer, Torralba, Antonio, Tenenbaum, Joshua B., Shu, Tianmin

arXiv.org Artificial Intelligence

Theory of Mind (ToM), the ability to understand people's minds, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data, which can include visual cues, linguistic narratives, or both. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.