basic skill
HEMM: Holistic Evaluation of Multimodal Foundation Models
Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge.
HEMM: Holistic Evaluation of Multimodal Foundation Models
Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance.
Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations
Zhao, Hanyi, Zhu, Jinxuan, Yan, Zihao, Li, Yichen, Deng, Yuhong, Wang, Xueqian
Multi-step cloth manipulation is a challenging problem for robots due to the high-dimensional state spaces and the dynamics of cloth. Despite recent significant advances in end-to-end imitation learning for multi-step cloth manipulation skills, these methods fail to generalize to unseen tasks. Our insight in tackling the challenge of generalizable multi-step cloth manipulation is decomposition. We propose a novel pipeline that autonomously learns basic skills from long demonstrations and composes learned basic skills to generalize to unseen tasks. Specifically, our method first discovers and learns basic skills from the existing long demonstration benchmark with the commonsense knowledge of a large language model (LLM). Then, leveraging a high-level LLM-based task planner, these basic skills can be composed to complete unseen tasks. Experimental results demonstrate that our method outperforms baseline methods in learning multi-step cloth manipulation skills for both seen and unseen tasks.
PROSKILL: A formal skill language for acting in robotics
Acting is an important decisional function to ensure proper deliberation on an autonomous system (Ingrand and Ghallab, 2017). It often sits between planning and the platform, but unlike planning it is an online process, which must stay reactive to the dynamic of the environment and the platform and cannot devote resources to long computations and complex searches. Acting often relies on models, called skills, which specify how to perform actions (as an operational model), while the action models used for planning are more what is abstractly needed to perform the action (as a descriptive model) (Ghallab et al., 2016). The most basic skills need to connect to the commands made available by the functional level to the acting component, call them asynchronously, get execution status and result, but it also needs means to receive exogenous events as they occur in the environment. This action/command dispatching may also rely on preconditions and invariants checking, interruptions, temporal constraints, etc. Above the basic skills one often finds more complex skills, similar to programs with control structures to allow for local choices and local recoveries with test, branching, looping, parallel and asynchronous execution. Considering the expected functionalities of an acting component, its skill language/framework should provide the following features: Support for Validation and Verification (V&V). Notwithstanding the other functionalities, this is the feature the work presented in this paper focuses on. One cannot only rely on basic skills connecting to the robot commands, one also needs some programming primitives (e.g., test, branching, loop). 1
We Choose to Go to Space: Agent-driven Human and Multi-Robot Collaboration in Microgravity
Xin, Miao, You, Zhongrui, Zhang, Zihan, Jiang, Taoran, Xu, Tingjia, Liang, Haotian, Ge, Guojing, Ji, Yuchen, Mo, Shentong, Cheng, Jian
We present SpaceAgents-1, a system for learning human and multi-robot collaboration (HMRC) strategies under microgravity conditions. Future space exploration requires humans to work together with robots. However, acquiring proficient robot skills and adept collaboration under microgravity conditions poses significant challenges within ground laboratories. To address this issue, we develop a microgravity simulation environment and present three typical configurations of intra-cabin robots. We propose a hierarchical heterogeneous multi-agent collaboration architecture: guided by foundation models, a Decision-Making Agent serves as a task planner for human-robot collaboration, while individual Skill-Expert Agents manage the embodied control of robots. This mechanism empowers the SpaceAgents-1 system to execute a range of intricate long-horizon HMRC tasks.
Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks
Yuan, Haoqi, Zhang, Chi, Wang, Hongcheng, Xie, Feiyang, Cai, Penglin, Dong, Hao, Lu, Zongqing
We study building multi-task agents in open-world environments. Without human demonstrations, learning to accomplish long-horizon tasks in a large open-world environment with reinforcement learning (RL) is extremely inefficient. To tackle this challenge, we convert the multi-task learning problem into learning basic skills and planning over the skills. Using the popular open-world game Minecraft as the testbed, we propose three types of fine-grained basic skills, and use RL with intrinsic rewards to acquire skills. A novel Finding-skill that performs exploration to find diverse items provides better initialization for other skills, improving the sample efficiency for skill learning. In skill planning, we leverage the prior knowledge in Large Language Models to find the relationships between skills and build a skill graph. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines by a large margin and is the most sample-efficient demonstration-free RL method to solve Minecraft Tech Tree tasks. The project's website and code can be found at https://sites.google.com/view/plan4mc.
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models
Chen, Jiaao, Pan, Xiaoman, Yu, Dian, Song, Kaiqiang, Wang, Xiaoyang, Yu, Dong, Chen, Jianshu
We consider the problem of eliciting compositional generalization capabilities in large language models (LLMs) with a novel type of prompting strategy. Compositional generalization empowers the LLMs to solve problems that are harder than the ones they have seen (i.e., easy-to-hard generalization), which is a critical reasoning capability of human-like intelligence. However, even the current state-of-the-art LLMs still struggle with this form of reasoning. To bridge this gap, we propose skills-in-context (SKiC) prompting, which instructs LLMs how to compose basic skills to resolve more complex problems. We find that it is crucial to demonstrate both the skills and the compositional examples within the same prompting context. With as few as two examplars, our SKiC prompting initiates strong synergies between skills and their composition capabilities. Notably, it empowers LLMs to solve unseen problems that require innovative skill compositions, achieving near-perfect generalization on a broad range of challenging compositionality tasks. Intriguingly, SKiC prompting unlocks the latent potential of LLMs, enabling them to leverage pre-existing internal skills acquired during earlier pre-training stages, even when these skills are not explicitly presented in the prompting context. This results in the capability of LLMs to solve unseen complex problems by activating and composing internal competencies. With such prominent features, SKiC prompting is able to achieve state-of-the-art performance on challenging mathematical reasoning benchmarks (e.g., MATH).
Multi-task Hierarchical Adversarial Inverse Reinforcement Learning
Chen, Jiayu, Tamboli, Dipesh, Lan, Tian, Aggarwal, Vaneet
Multi-task Imitation Learning (MIL) aims to train a policy capable of performing a distribution of tasks based on multi-task expert demonstrations, which is essential for general-purpose robots. Existing MIL algorithms suffer from low data efficiency and poor performance on complex long-horizontal tasks. We develop Multi-task Hierarchical Adversarial Inverse Reinforcement Learning (MH-AIRL) to learn hierarchically-structured multi-task policies, which is more beneficial for compositional tasks with long horizons and has higher expert data efficiency through identifying and transferring reusable basic skills across tasks. To realize this, MH-AIRL effectively synthesizes context-based multi-task learning, AIRL (an IL approach), and hierarchical policy learning. Further, MH-AIRL can be adopted to demonstrations without the task or skill annotations (i.e., state-action pairs only) which are more accessible in practice. Theoretical justifications are provided for each module of MH-AIRL, and evaluations on challenging multi-task settings demonstrate superior performance and transferability of the multi-task policies learned with MH-AIRL as compared to SOTA MIL baselines.
Horn
We regularly encounter complex activities consisting of basic skills-- both conscious and subconscious. Adequately performing these complex activities involves mastering the individual basic skills and having the ability to seamlessly integrate them together. Games are one such example of a complex activity that is difficult to break down into the basic skills required, but engagement in games relies on designers introducing challenges proportionate to a player's skill. Procedurally generated levels cause additional problems since it is hard to estimate level difficulty for a particular player. This proposal suggests a framework for determining the skills necessary to successfully complete a game, creating AI-based bots with those skills to reflect players with the same skills, and identifying and generating optimal orderings of levels to promote learning each skill of a game.
Automatic Extension of a Symbolic Mobile Manipulation Skill Set
Förster, Julian, Ott, Lionel, Nieto, Juan, Siegwart, Roland, Chung, Jen Jen
Symbolic planning can provide an intuitive interface for non-expert users to operate autonomous robots by abstracting away much of the low-level programming. However, symbolic planners assume that the initially provided abstract domain and problem descriptions are closed and complete. This means that they are fundamentally unable to adapt to changes in the environment or task that are not captured by the initial description. We propose a method that allows an agent to automatically extend its skill set, and thus the abstract description, upon encountering such a situation. We introduce strategies for generalizing from previous experience, completing sequences of key actions and discovering preconditions to ensure the efficiency of our skill sequence exploration scheme. The resulting system is evaluated in simulation on object rearrangement tasks. Compared to a Monte Carlo Tree Search baseline, our strategies for efficient search have on average a 29% higher success rate at a 68% faster runtime.