faucet
ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning
Choi, Jae-Woo, Kim, Hyungmin, Ong, Hyobin, Jang, Minsu, Kim, Dohyung, Kim, Jaehong, Yoon, Youngwoo
Recent advancements in large language models (LLMs) have enabled significant progress in decision-making and task planning for embodied autonomous agents. However, most existing methods still struggle with complex, long-horizon tasks because they rely on a monolithic trajectory that entangles all past decisions and observations, attempting to solve the entire task in a single unified process. To address this limitation, we propose ReAcTree, a hierarchical task-planning method that decomposes a complex goal into more manageable subgoals within a dynamically constructed agent tree. Each subgoal is handled by an LLM agent node capable of reasoning, acting, and further expanding the tree, while control flow nodes coordinate the execution strategies of agent nodes. In addition, we integrate two complementary memory systems: each agent node retrieves goal-specific, subgoal-level examples from episodic memory and shares environment-specific observations through working memory. Experiments on the WAH-NL and ALFRED datasets demonstrate that ReAcTree consistently outperforms strong task-planning baselines such as ReAct across diverse LLMs. Notably, on WAH-NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct's 31%.
ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning
Liang, Yichao, Nguyen, Dat, Yang, Cambridge, Li, Tianyang, Tenenbaum, Joshua B., Rasmussen, Carl Edward, Weller, Adrian, Tavares, Zenna, Silver, Tom, Ellis, Kevin
Long-horizon embodied planning is challenging because the world does not only change through an agent's actions: exogenous processes (e.g., water heating, dominoes cascading) unfold concurrently with the agent's actions. We propose a framework for abstract world models that jointly learns (i) symbolic state representations and (ii) causal processes for both endogenous actions and exogenous mechanisms. Each causal process models the time course of a stochastic cause-effect relation. We learn these world models from limited data via variational Bayesian inference combined with LLM proposals. Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held-out tasks with more objects and more complex goals, outperforming a range of baselines.
Appendix A Broader Societal Impact
We introduce a new method for language-conditioned imitation learning to perform complex navigation and manipulation tasks. Our intention is for this algorithm to be used in a real-world setting where humans can provide natural language instructions to robots that can carry them out. This is the best we can do since we don't know exactly which tokens in the instruction correspond to the skills chosen. We have provided details about the levels we evaluated on below. More details can be found in the original paper.
GSL-PCD: Improving Generalist-Specialist Learning with Point Cloud Feature-based Task Partitioning
Generalization in Deep Reinforcement Learning across unseen environment variations often requires training over a diverse set of scenarios. However, random task partitioning in GSL can impede specialist performance, as it often assigns vastly different variations to the same specialist, typically resulting in each specialist being assigned just one variation, which increases computational costs. To improve this, we propose Generalist-Specialist Learning with Point Cloud Featurebased Task Partitioning (GSL-PCD). This approach clusters environment variations based on features extracted from object point clouds, using balanced clustering with a greedy algorithm to assign similar variations to the same specialist. Evaluations on robotic manipulation tasks from the ManiSkill benchmark demonstrate that point cloud feature-based partitioning outperforms vanilla partitioning by 9.4% with a fixed number of specialists and reduces computational and sample requirements by 50% to achieve comparable performance.
Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance
Zhang, Kaifeng, Yin, Zhao-Heng, Ye, Weirui, Gao, Yang
Defining reward functions for skill learning has been a long-standing challenge in robotics. Recently, vision-language models (VLMs) have shown promise in defining reward signals for teaching robots manipulation skills. However, existing works often provide reward guidance that is too coarse, leading to inefficient learning processes. In this paper, we address this issue by implementing more fine-grained reward guidance. We decompose tasks into simpler sub-tasks, using this decomposition to offer more informative reward guidance with VLMs. We also propose a VLM-based self imitation learning process to speed up learning. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a $5.4 \times$ higher average success rate compared to the best baseline, RoboCLIP, across a series of manipulation tasks.
REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction
Liu, Zeyi, Bahety, Arpit, Song, Shuran
The ability to detect and analyze failed executions automatically is crucial for an explainable and robust robotic system. Recently, Large Language Models (LLMs) have demonstrated strong reasoning abilities on textual inputs. To leverage the power of LLMs for robot failure explanation, we introduce REFLECT, a framework which queries LLM for failure reasoning based on a hierarchical summary of robot past experiences generated from multisensory observations. The failure explanation can further guide a language-based planner to correct the failure and complete the task. To systematically evaluate the framework, we create the RoboFail dataset with a variety of tasks and failure scenarios. We demonstrate that the LLM-based framework is able to generate informative failure explanations that assist successful correction planning.
Language to Rewards for Robotic Skill Synthesis
Yu, Wenhao, Gileadi, Nimrod, Fu, Chuyuan, Kirmani, Sean, Lee, Kuang-Huei, Arenas, Montse Gonzalez, Chiang, Hao-Tien Lewis, Erez, Tom, Hasenclever, Leonard, Humplik, Jan, Ichter, Brian, Xiao, Ted, Xu, Peng, Zeng, Andy, Zhang, Tingnan, Heess, Nicolas, Sadigh, Dorsa, Tan, Jie, Tassa, Yuval, Xia, Fei
Large language models (LLMs) have demonstrated exciting progress in acquiring diverse new capabilities through in-context learning, ranging from logical reasoning to code-writing. Robotics researchers have also explored using LLMs to advance the capabilities of robotic control. However, since low-level robot actions are hardware-dependent and underrepresented in LLM training corpora, existing efforts in applying LLMs to robotics have largely treated LLMs as semantic planners or relied on human-engineered control primitives to interface with the robot. On the other hand, reward functions are shown to be flexible representations that can be optimized for control policies to achieve diverse tasks, while their semantic richness makes them suitable to be specified by LLMs. In this work, we introduce a new paradigm that harnesses this realization by utilizing LLMs to define reward parameters that can be optimized and accomplish variety of robotic tasks. Using reward as the intermediate interface generated by LLMs, we can effectively bridge the gap between high-level language instructions or corrections to low-level robot actions. Meanwhile, combining this with a real-time optimizer, MuJoCo MPC, empowers an interactive behavior creation experience where users can immediately observe the results and provide feedback to the system. To systematically evaluate the performance of our proposed method, we designed a total of 17 tasks for a simulated quadruped robot and a dexterous manipulator robot. We demonstrate that our proposed method reliably tackles 90% of the designed tasks, while a baseline using primitive skills as the interface with Code-as-policies achieves 50% of the tasks. We further validated our method on a real robot arm where complex manipulation skills such as non-prehensile pushing emerge through our interactive system.
Sim2Real$^2$: Actively Building Explicit Physics Model for Precise Articulated Object Manipulation
Ma, Liqian, Meng, Jiaojiao, Liu, Shuntao, Chen, Weihang, Xu, Jing, Chen, Rui
Accurately manipulating articulated objects is a challenging yet important task for real robot applications. In this paper, we present a novel framework called Sim2Real$^2$ to enable the robot to manipulate an unseen articulated object to the desired state precisely in the real world with no human demonstrations. We leverage recent advances in physics simulation and learning-based perception to build the interactive explicit physics model of the object and use it to plan a long-horizon manipulation trajectory to accomplish the task. However, the interactive model cannot be correctly estimated from a static observation. Therefore, we learn to predict the object affordance from a single-frame point cloud, control the robot to actively interact with the object with a one-step action, and capture another point cloud. Further, the physics model is constructed from the two point clouds. Experimental results show that our framework achieves about 70% manipulations with <30% relative error for common articulated objects, and 30% manipulations for difficult objects. Our proposed framework also enables advanced manipulation strategies, such as manipulating with different tools. Code and videos are available on our project webpage: https://ttimelord.github.io/Sim2Real2-site/