Goto

Collaborating Authors

 bookshelf


Non-Collaborative User Simulators for Tool Agents

arXiv.org Artificial Intelligence

Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $ฯ„$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.


ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning

arXiv.org Artificial Intelligence

Recent advancements in large language models (LLMs) have enabled significant progress in decision-making and task planning for embodied autonomous agents. However, most existing methods still struggle with complex, long-horizon tasks because they rely on a monolithic trajectory that entangles all past decisions and observations, attempting to solve the entire task in a single unified process. To address this limitation, we propose ReAcTree, a hierarchical task-planning method that decomposes a complex goal into more manageable subgoals within a dynamically constructed agent tree. Each subgoal is handled by an LLM agent node capable of reasoning, acting, and further expanding the tree, while control flow nodes coordinate the execution strategies of agent nodes. In addition, we integrate two complementary memory systems: each agent node retrieves goal-specific, subgoal-level examples from episodic memory and shares environment-specific observations through working memory. Experiments on the WAH-NL and ALFRED datasets demonstrate that ReAcTree consistently outperforms strong task-planning baselines such as ReAct across diverse LLMs. Notably, on WAH-NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct's 31%.


VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

arXiv.org Artificial Intelligence

Abstract-- Current vision-language-action (VLA) models, pre-trained on large-scale robotic data, exhibit strong multi-task capabilities and generalize well to variations in visual and language instructions for manipulation. However, their success rate drops significantly when faced with object concepts outside the training data, such as unseen object descriptions and textures in the dataset. Based on the LIBERO simulation environment, we introduced novel objects and object descriptions to construct a new evaluation benchmark with three difficulty levels to test the effectiveness of our method. Our framework successfully outperformed the current state-of-the-art models on our designed hard-level generalization benchmark. I. INTRODUCTION In recent years, foundation models have profoundly influenced the development of artificial intelligence research. In the field of robotics, Vision-Language-Action (VLA) models [10]-[16] built upon vision-language models represent a prominent research paradigm. This approach effectively harnesses the learning capacity of large-scale models and shows strong potential to serve as a foundational backbone for general-purpose robots performing manipulation tasks in open-world environments in the future. In evaluation involving unseen concepts (i.e., object textures and language descriptions outside the dataset), our proposed framework surpasses other state-of-the-art models finetuned on the original LIBERO dataset. In contrast, the reproduced Agentic Robot framework [17] using our model exhibits a significantly noticeable performance degradation in this task. Some researchers have attempted to jointly train robotic manipulation data with web-scale multimodal data [10], [14], aiming to preserve extensive conceptual knowledge during training and thereby enhance generalization in manipulation tasks.


Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

arXiv.org Artificial Intelligence

Collaboration is ubiquitous and essential in day-to-day life -- from exchanging ideas, to delegating tasks, to generating plans together. This work studies how LLMs can adaptively collaborate to perform complex embodied reasoning tasks. To this end we introduce MINDcraft, an easily extensible platform built to enable LLM agents to control characters in the open-world game of Minecraft; and MineCollab, a benchmark to test the different dimensions of embodied and collaborative reasoning. An experimental study finds that the primary bottleneck in collaborating effectively for current state-of-the-art agents is efficient natural language communication, with agent performance dropping as much as 15% when they are required to communicate detailed task completion plans. We conclude that existing LLM agents are ill-optimized for multi-agent collaboration, especially in embodied scenarios, and highlight the need to employ methods beyond in-context and imitation learning. Our website can be found here: https://mindcraft-minecollab.github.io/


Kallini et al. (2024) do not compare impossible languages with constituency-based ones

arXiv.org Artificial Intelligence

A central goal of linguistic theory is to find a precise characterization of the notion "possible human language", in the form of a computational device that is capable of describing all and only the languages that can be acquired by a typically developing human child. The success of recent large language models (LLMs) in NLP applications arguably raises the possibility that LLMs might be computational devices that meet this goal. This would only be the case if, in addition to succeeding in learning human languages, LLMs struggle to learn "impossible" human languages. Kallini et al. (2024; "Mission: Impossible Language Models", Proc. ACL) conducted experiments aiming to test this by training GPT-2 on a variety of synthetic languages, and found that it learns some more successfully than others. They present these asymmetries as support for the idea that LLMs' inductive biases align with what is regarded as "possible" for human languages, but the most significant comparison has a confound that makes this conclusion unwarranted. In this paper I explain the confound and suggest some ways forward towards constructing a comparison that appropriately tests the underlying issue.


Reflex-Based Open-Vocabulary Navigation without Prior Knowledge Using Omnidirectional Camera and Multiple Vision-Language Models

arXiv.org Artificial Intelligence

Various robot navigation methods have been developed, but they are mainly based on Simultaneous Localization and Mapping (SLAM), reinforcement learning, etc., which require prior map construction or learning. In this study, we consider the simplest method that does not require any map construction or learning, and execute open-vocabulary navigation of robots without any prior knowledge to do this. We applied an omnidirectional camera and pre-trained vision-language models to the robot. The omnidirectional camera provides a uniform view of the surroundings, thus eliminating the need for complicated exploratory behaviors including trajectory generation. By applying multiple pre-trained vision-language models to this omnidirectional image and incorporating reflective behaviors, we show that navigation becomes simple and does not require any prior setup. Interesting properties and limitations of our method are discussed based on experiments with the mobile robot Fetch.


Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework

arXiv.org Artificial Intelligence

The socially-aware navigation system has evolved to adeptly avoid various obstacles while performing multiple tasks, such as point-to-point navigation, human-following, and -guiding. However, a prominent gap persists: in Human-Robot Interaction (HRI), the procedure of communicating commands to robots demands intricate mathematical formulations. Furthermore, the transition between tasks does not quite possess the intuitive control and user-centric interactivity that one would desire. In this work, we propose an LLM-driven interactive multimodal multitask robot navigation framework, termed LIM2N, to solve the above new challenge in the navigation field. We achieve this by first introducing a multimodal interaction framework where language and hand-drawn inputs can serve as navigation constraints and control objectives. Next, a reinforcement learning agent is built to handle multiple tasks with the received information. Crucially, LIM2N creates smooth cooperation among the reasoning of multimodal input, multitask planning, and adaptation and processing of the intelligent sensing modules in the complicated system. Extensive experiments are conducted in both simulation and the real world demonstrating that LIM2N has superior user needs understanding, alongside an enhanced interactive experience.


Plug in the Safety Chip: Enforcing Constraints for LLM-driven Robot Agents

arXiv.org Artificial Intelligence

Recent advancements in large language models (LLMs) have enabled a new research domain, LLM agents, for solving robotics and planning tasks by leveraging the world knowledge and general reasoning abilities of LLMs obtained during pretraining. However, while considerable effort has been made to teach the robot the "dos," the "don'ts" received relatively less attention. We argue that, for any practical usage, it is as crucial to teach the robot the "don'ts": conveying explicit instructions about prohibited actions, assessing the robot's comprehension of these restrictions, and, most importantly, ensuring compliance. Moreover, verifiable safe operation is essential for deployments that satisfy worldwide standards such as ISO 61508, which defines standards for safely deploying robots in industrial factory environments worldwide. Aiming at deploying the LLM agents in a collaborative environment, we propose a queryable safety constraint module based on linear temporal logic (LTL) that simultaneously enables natural language (NL) to temporal constraints encoding, safety violation reasoning and explaining, and unsafe action pruning. To demonstrate the effectiveness of our system, we conducted experiments in VirtualHome environment and on a real robot. The experimental results show that our system strictly adheres to the safety constraints and scales well with complex safety constraints, highlighting its potential for practical utility.


Task Generalization with Stability Guarantees via Elastic Dynamical System Motion Policies

arXiv.org Artificial Intelligence

With advanced development in robotics and autonomous systems in the past decades, the opportunities and demands for more complex physical human-robot interaction (pHRI) in our everyday unconstrained environments are rising; thus, it is critical for robots to be adaptive, compliant, reactive, safe and easy to program [1, 2, 3]. In many cases, robots will need to acquire new skills to satisfy task requirements in an ever-changing environment. It is usually difficult for non-experts to program robots for complex motion tasks and even tedious for experts to reprogram them when task requirements change. A straightforward and intuitive approach for robots to develop new skills is through Learning from Demonstration (LfD) [4, 5, 6, 7, 8]. This paradigm allows robots to acquire skills, typically encoded or defined in literature as action policies, motion policies, or imitation policies, directly from motion examples provided by humans or even other robots, mirroring a teacher-student relationship. In recent years, significant progress has been made in using LfD to learn complex and diverse motion tasks.


The Planner Optimization Problem: Formulations and Frameworks

arXiv.org Artificial Intelligence

Identifying internal parameters for planning is crucial to maximizing the performance of a planner. However, automatically tuning internal parameters which are conditioned on the problem instance is especially challenging. A recent line of work focuses on learning planning parameter generators, but lack a consistent problem definition and software framework. This work proposes the unified planner optimization problem (POP) formulation, along with the Open Planner Optimization Framework (OPOF), a highly extensible software framework to specify and to solve these problems in a reusable manner.