saycan
ConceptBot: Enhancing Robot's Autonomy through Task Decomposition with Large Language Models and Knowledge Graph
Leanza, Alessandro, Moroncelli, Angelo, Vizzari, Giuseppe, Braghin, Francesco, Roveda, Loris, Spahiu, Blerina
--ConceptBot is a modular robotic planning framework that combines Large Language Models and Knowledge Graphs to generate feasible and risk-aware plans despite ambiguities in natural language instructions and correctly analyzing the objects present in the environment--challenges that typically arise from a lack of commonsense reasoning. T o do that, ConceptBot integrates (i) an Object Property Extraction (OPE) module that enriches scene understanding with semantic concepts from ConceptNet, (ii) a User Request Processing (URP) module that disambiguates and structures instructions, and (iii) a Planner that generates context-aware, feasible pick-and-place policies. In comparative evaluations against Google SayCan, ConceptBot achieved 100% success on explicit tasks, maintained 87% accuracy on implicit tasks (versus 31% for SayCan), reached 76% on risk-aware tasks (versus 15%), and outperformed SayCan in application-specific scenarios, including material classification (70% vs. 20%) and toxicity detection (86% vs. 36%). On SafeAgentBench, ConceptBot achieved an overall score of 80% (versus 46% for the next-best baseline). These results, validated in both simulation and laboratory experiments, demonstrate ConceptBot's ability to generalize without domain-specific training and to significantly improve the reliability of robotic policies in unstructured environments. Advances in recent decades in robotic core capabilities, i.e., perception, control, and manipulation, have increased demand for autonomous systems in fields ranging from manufacturing to healthcare, logistics to home care, etc. These capabilities are deeply interconnected with the planning phase [1], as successful planning depends on a robot's ability to perceive its environment accurately, execute precise control, and perform effective manipulation. Despite significant progress, planning in robotic systems continues to face challenges, particularly in unstructured environments [2]. A key element in achieving effective planning is task decomposition [3], which involves breaking complex objectives into smaller, manageable actions. This process is essential for simplifying execution and ensuring flexibility in diverse environments. Traditional task decomposition approaches, however, often rely on rigid, pre-programmed templates or static models, which struggle to adapt to unfamiliar or dynamic conditions [4]-[7]. Recently, advancements in Large Language Models (LLMs) have introduced a more dynamic alternative. LLMs enable robots to process natural language instructions, understand contextual nuances, and dynamically decompose tasks into actionable steps [8]-[10]. However, directly employing pre-trained LLMs often leads to non-executable or ineffective plans, as these models struggle to account for domain-specific constraints and real-world feasibility [11]- [13].
CAPE: Corrective Actions from Precondition Errors using Large Language Models
Raman, Shreyas Sundara, Cohen, Vanya, Paulius, David, Idrees, Ifrah, Rosen, Eric, Mooney, Ray, Tellex, Stefanie
Extracting commonsense knowledge from a large language model (LLM) offers a path to designing intelligent robots. Existing approaches that leverage LLMs for planning are unable to recover when an action fails and often resort to retrying failed actions, without resolving the error's underlying cause. We propose a novel approach (CAPE) that attempts to propose corrective actions to resolve precondition errors during planning. CAPE improves the quality of generated plans by leveraging few-shot reasoning from action preconditions. Our approach enables embodied agents to execute more tasks than baseline methods while ensuring semantic correctness and minimizing re-prompting. In VirtualHome, CAPE generates executable plans while improving a human-annotated plan correctness metric from 28.89% to 49.63% over SayCan. Our improvements transfer to a Boston Dynamics Spot robot initialized with a set of skills (specified in language) and associated preconditions, where CAPE improves the correctness metric of the executed task plans by 76.49% compared to SayCan. Our approach enables the robot to follow natural language commands and robustly recover from failures, which baseline approaches largely cannot resolve or address inefficiently.
A Picture is Worth a Thousand Words: Language Models Plan from Pixels
Liu, Anthony Z., Logeswaran, Lajanugen, Sohn, Sungryull, Lee, Honglak
Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments. In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments. Prior PLM based approaches for planning either assume observations are available in the form of text (e.g., provided by a captioning model), reason about plans from the instruction alone, or incorporate information about the visual environment in limited ways (such as a pre-trained affordance function). In contrast, we show that PLMs can accurately plan even when observations are directly encoded as input prompts for the PLM. We show that this simple approach outperforms prior approaches in experiments on the ALFWorld and VirtualHome benchmarks.
RT-1: Robotics Transformer for Real-World Control at Scale – Google AI Blog
Major recent advances in multiple subfields of machine learning (ML) research, such as computer vision and natural language processing, have been enabled by a shared common approach that leverages large, diverse datasets and expressive models that can absorb all of the data effectively. Although there have been various attempts to apply this approach to robotics, robots have not yet leveraged highly-capable models as well as other subfields. Several factors contribute to this challenge. First, there's the lack of large-scale and diverse robotic data, which limits a model's ability to absorb a broad set of robotic experiences. Data collection is particularly expensive and challenging for robotics because dataset curation requires engineering-heavy autonomous operation, or demonstrations collected using human teleoperations. To address these challenges, we propose the Robotics Transformer 1 (RT-1), a multi-task model that tokenizes robot inputs and outputs actions (e.g., camera images, task instructions, and motor commands) to enable efficient inference at runtime, which makes real-time control feasible.
Open-vocabulary Queryable Scene Representations for Real World Planning
Chen, Boyuan, Xia, Fei, Ichter, Brian, Rao, Kanishka, Gopalakrishnan, Keerthana, Ryoo, Michael S., Stone, Austin, Kappler, Daniel
Abstract-- Large language models (LLMs) have unlocked new capabilities of task planning from human instructions. NLMap first establishes a natural language queryable scene representation with Visual Language models (VLMs). An LLM based object proposal module parses instructions and proposes involved objects to query the scene representation for object availability and location. An LLM planner then plans with such information about the scene. We propose an open-vocabulary and queryable scene representation for real-world planning. The returned object presence and location are used for LLM-based planning. It has to first identify relevant objects and upon it. Recent progress in large language models (LLMs), locations within the scene (e.g., the watering can, the sink, and has shown impressive few-shot performance in language each potential plant) and then plan over these objects in sequential comprehension, semantic understanding, and reasoning, as order (get the watering can, then go the sink, and then fill it well as application to robotics problems like planning [5]-[7] up), conditioning on its affordances (e.g., can it carry a full and instruction following [8]. Using such models in embodied watering can), and conditioning on the scene (e.g., how many settings can provide significant challenges, most critically because plants there are, and where are they).
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Ahn, Michael, Brohan, Anthony, Brown, Noah, Chebotar, Yevgen, Cortes, Omar, David, Byron, Finn, Chelsea, Fu, Chuyuan, Gopalakrishnan, Keerthana, Hausman, Karol, Herzog, Alex, Ho, Daniel, Hsu, Jasmine, Ibarz, Julian, Ichter, Brian, Irpan, Alex, Jang, Eric, Ruano, Rosario Jauregui, Jeffrey, Kyle, Jesmonth, Sally, Joshi, Nikhil J, Julian, Ryan, Kalashnikov, Dmitry, Kuang, Yuheng, Lee, Kuang-Huei, Levine, Sergey, Lu, Yao, Luu, Linda, Parada, Carolina, Pastor, Peter, Quiambao, Jornell, Rao, Kanishka, Rettinghouse, Jarek, Reyes, Diego, Sermanet, Pierre, Sievers, Nicolas, Tan, Clayton, Toshev, Alexander, Vanhoucke, Vincent, Xia, Fei, Xiao, Ted, Xu, Peng, Xu, Sichun, Yan, Mengyuan, Zeng, Andy
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website, the video, and open sourced code in a tabletop domain can be found at say-can.github.io. Figure 1: LLMs have not interacted with their environment and observed the outcome of their responses, and thus are not grounded in the world. SayCan grounds LLMs via value functions of pretrained skills, allowing them to execute real-world, abstract, long-horizon commands on robots.
Deep Science: Vision plus language could yield capable AI – TechCrunch
Depending on the theory of intelligence to which you subscribe, achieving "human-level" AI will require a system that can leverage multiple modalities -- e.g., sound, vision and text -- to reason about the world. For example, when shown an image of a toppled truck and a police cruiser on a snowy freeway, a human-level AI might infer that dangerous road conditions caused an accident. Or, running on a robot, when asked to grab a can of soda from the refrigerator, they'd navigate around people, furniture and pets to retrieve the can and place it within reach of the requester. But new research shows signs of encouraging progress, from robots that can figure out steps to satisfy basic commands (e.g., "get a water bottle") to text-producing systems that learn from explanations. In this revived edition of Deep Science, our weekly series about the latest developments in AI and the broader scientific field, we're covering work out of DeepMind, Google and OpenAI that makes strides toward systems that can -- if not perfectly understand the world -- solve narrow tasks like generating images with impressive robustness.