task step
TaskBench: BenchmarkingLargeLanguage ModelsforTaskAutomation
To address this, we introduceTASKBENCH, a comprehensive framework to evaluate the capability of LLMs in task automation. Specifically, task automation can be divided into three critical stages: task decomposition, tool selection, and parameter prediction. To tackle the complexities inherent in these stages, we introduce the concept of Tool Graph to represent decomposed tasksandadoptaback-instruct method togenerate high-quality userinstructions. We propose TASKEVAL, a multi-faceted evaluation methodology that assesses LLMperformance across thesethreestages.
Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence
Sharrock, Callum, Petersson, Lukas, Petersson, Hanna, Backlund, Axel, Wennström, Axel, Nordström, Kristoffer, Aronsson, Elias
We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench. Language models (LMs) were initially intended for narrow text understanding tasks. The first Transformer-based LM (V aswani et al., 2017) was explicitly trained for translation. However, large-scale training runs of LMs eventually resulted in emergent behaviour - model capabilities that were not explicitly trained for (Brown et al., 2020). For example, LLMs are not trained to be robots, yet companies such as Figure (Helix, 2025) and Google DeepMind (Gemini Robotics 1.5, 2025) use LLMs in their robotic stack.
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application
Huang, Zhe, Pohovey, John, Yammanuru, Ananya, Driggs-Campbell, Katherine
Large Language Models (LLM) and Vision Language Models (VLM) enable robots to ground natural language prompts into control actions to achieve tasks in an open world. However, when applied to a long-horizon collaborative task, this formulation results in excessive prompting for initiating or clarifying robot actions at every step of the task. We propose Language-driven Intention Tracking (LIT), leveraging LLMs and VLMs to model the human user's long-term behavior and to predict the next human intention to guide the robot for proactive collaboration. We demonstrate smooth coordination between a LIT-based collaborative robot and the human user in collaborative cooking tasks.
Agile and versatile bipedal robot tracking control through reinforcement learning
Li, Jiayi, Ye, Linqi, Cheng, Yi, Liu, Houde, Liang, Bin
The remarkable athletic intelligence displayed by humans in complex dynamic movements such as dancing and gymnastics suggests that the balance mechanism in biological beings is decoupled from specific movement patterns. This decoupling allows for the execution of both learned and unlearned movements under certain constraints while maintaining balance through minor whole-body coordination. To replicate this balance ability and body agility, this paper proposes a versatile controller for bipedal robots. This controller achieves ankle and body trajectory tracking across a wide range of gaits using a single small-scale neural network, which is based on a model-based IK solver and reinforcement learning. We consider a single step as the smallest control unit and design a universally applicable control input form suitable for any single-step variation. Highly flexible gait control can be achieved by combining these minimal control units with high-level policy through our extensible control interface. To enhance the trajectory-tracking capability of our controller, we utilize a three-stage training curriculum. After training, the robot can move freely between target footholds at varying distances and heights. The robot can also maintain static balance without repeated stepping to adjust posture. Finally, we evaluate the tracking accuracy of our controller on various bipedal tasks, and the effectiveness of our control framework is verified in the simulation environment.
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance
Zhang, Jesse, Zhang, Jiahui, Pertsch, Karl, Liu, Ziyi, Ren, Xiang, Chang, Minsuk, Sun, Shao-Hua, Lim, Joseph J.
We propose BOSS, an approach that automatically learns to solve new long-horizon, complex, and meaningful tasks by growing a learned skill library with minimal supervision. Prior work in reinforcement learning require expert supervision, in the form of demonstrations or rich reward functions, to learn long-horizon tasks. Instead, our approach BOSS (BOotStrapping your own Skills) learns to accomplish new tasks by performing "skill bootstrapping," where an agent with a set of primitive skills interacts with the environment to practice new skills without receiving reward feedback for tasks outside of the initial skill set. This bootstrapping phase is guided by large language models (LLMs) that inform the agent of meaningful skills to chain together. Through this process, BOSS builds a wide range of complex and useful behaviors from a basic set of primitive skills. We demonstrate through experiments in realistic household environments that agents trained with our LLM-guided bootstrapping procedure outperform those trained with naive bootstrapping as well as prior unsupervised skill acquisition methods on zero-shot execution of unseen, long-horizon tasks in new environments. Website at clvrai.com/boss.
DiLogics: Creating Web Automation Programs With Diverse Logics
Pu, Kevin, Yang, Jim, Yuan, Angel, Ma, Minyi, Dong, Rui, Wang, Xinyu, Chen, Yan, Grossman, Tovi
Knowledge workers frequently encounter repetitive web data entry tasks, like updating records or placing orders. Web automation increases productivity, but translating tasks to web actions accurately and extending to new specifications is challenging. Existing tools can automate tasks that perform the same logical trace of UI actions (e.g., input text in each field in order), but do not support tasks requiring different executions based on varied input conditions. We present DiLogics, a programming-by-demonstration system that utilizes NLP to assist users in creating web automation programs that handle diverse specifications. DiLogics first semantically segments input data to structured task steps. By recording user demonstrations for each step, DiLogics generalizes the web macros to novel but semantically similar task requirements. Our evaluation showed that non-experts can effectively use DiLogics to create automation programs that fulfill diverse input instructions. DiLogics provides an efficient, intuitive, and expressive method for developing web automation programs satisfying diverse specifications.
Improving Proactive Dialog Agents Using Socially-Aware Reinforcement Learning
Kraus, Matthias, Wagner, Nicolas, Riekenbrauck, Ron, Minker, Wolfgang
The next step for intelligent dialog agents is to escape their role as silent bystanders and become proactive. Well-defined proactive behavior may improve human-machine cooperation, as the agent takes a more active role during interaction and takes off responsibility from the user. However, proactivity is a double-edged sword because poorly executed pre-emptive actions may have a devastating effect not only on the task outcome but also on the relationship with the user. For designing adequate proactive dialog strategies, we propose a novel approach including both social as well as task-relevant features in the dialog. Here, the primary goal is to optimize proactive behavior so that it is task-oriented - this implies high task success and efficiency - while also being socially effective by fostering user trust. Including both aspects in the reward function for training a proactive dialog agent using reinforcement learning showed the benefit of our approach for more successful human-machine cooperation.
Development of a Trust-Aware User Simulator for Statistical Proactive Dialog Modeling in Human-AI Teams
Kraus, Matthias, Riekenbrauck, Ron, Minker, Wolfgang
HAIT requires close coordination between humans and AI teammates to work together towards a common goal [40]. Effective communication, prediction of teammates' actions, and high-level coordination are essential components of this collaborative effort. In this regard, the proactive behavior of AI-based systems and the communication thereof during collaboration is an important research topic concerning HAITs, e.g., see Horvitz et al. [8]. Proactivity can be defined as an AI's self-initiating, anticipatory behavior for contributing to effective and efficient task completion. It has been shown to be essential for human teamwork as it leads to higher job and team performance and is associated with leadership and innovation [3]. However, the design of adequate proactivity for AI-based systems to support humans is still an open question and a challenging topic. It is essential to study the impact of proactive system actions on the human-agent trust relationship and how to use information about an AI agent's perceived trustworthiness to model appropriate proactive dialog strategies for forming effective HAITs.
Abstract Demonstrations and Adaptive Exploration for Efficient and Stable Multi-step Sparse Reward Reinforcement Learning
Yang, Xintong, Ji, Ze, Wu, Jing, Lai, Yu-kun
Although Deep Reinforcement Learning (DRL) has been popular in many disciplines including robotics, state-of-the-art DRL algorithms still struggle to learn long-horizon, multi-step and sparse reward tasks, such as stacking several blocks given only a task-completion reward signal. To improve learning efficiency for such tasks, this paper proposes a DRL exploration technique, termed A^2, which integrates two components inspired by human experiences: Abstract demonstrations and Adaptive exploration. A^2 starts by decomposing a complex task into subtasks, and then provides the correct orders of subtasks to learn. During training, the agent explores the environment adaptively, acting more deterministically for well-mastered subtasks and more stochastically for ill-learnt subtasks. Ablation and comparative experiments are conducted on several grid-world tasks and three robotic manipulation tasks. We demonstrate that A^2 can aid popular DRL algorithms (DQN, DDPG, and SAC) to learn more efficiently and stably in these environments.