Goto

Collaborating Authors

 Dulac-Arnold, Gabriel


RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

arXiv.org Artificial Intelligence

We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiments. With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We find that for a fixed collection budget it is beneficial to take advantage of cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed RoboVQA containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named RoboVQA-VideoCoCa trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19% across all VQA tasks. Data and videos available at https://robovqa.github.io


Barkour: Benchmarking Animal-level Agility with Quadruped Robots

arXiv.org Artificial Intelligence

Abstract--Animals have evolved various agile locomotion strategies, such as sprinting, leaping, and jumping. There is a growing interest in developing legged robots that move like their biological counterparts and show various agile skills to navigate complex environments quickly. Despite the interest, the field lacks systematic benchmarks to measure the performance of control policies and hardware in agility. We introduce the Barkour benchmark, an obstacle course to quantify agility for legged robots. Inspired by dog agility competitions, it consists of diverse obstacles and a time based scoring mechanism. This encourages researchers to develop controllers that not only move fast, but do so in a controllable and versatile way. To set strong baselines, we present two methods for tackling the benchmark. In the first approach, we train specialist locomotion skills using on-policy reinforcement learning methods and combine them with a highlevel navigation controller. In the second approach, we distill the specialist skills into a Transformer-based generalist locomotion policy, named Locomotion-Transformer, that can handle various terrains and adjust the robot's gait based on the perceived There has been a proliferation of legged robot development inspired by animal mobility. An important research question in this field is how to develop a controller that enables legged robots to exhibit animal-level agility while also being able to generalize environments, such as up and down stairs, through bushes, across various obstacles and terrains. Through the exploration and over unpaved roads and rocky or even sandy beaches. of both learning and traditional control-based methods, there Despite advances in robot hardware and control, a major has been significant progress in enabling robots to walk across challenge in the field is the lack of standardized and intuitive a wide range of terrains [10, 21, 20, 1, 27]. These robots are methods for evaluating the effectiveness of locomotion now capable of walking in a variety of indoor and outdoor controllers.


Get Back Here: Robust Imitation by Return-to-Distribution Planning

arXiv.org Artificial Intelligence

Imitation Learning (IL) is a paradigm in sequential decision making where an agent uses offline expert trajectories to mimic the expert's behavior [1]. While Reinforcement Learning (RL) requires an additional reward signal that can be hard to specify in practice, IL only requires expert trajectories that can be easier to collect. In part due to its simplicity, IL has been applied successfully in several real world tasks, from robotic manipulation [2, 3, 4] to autonomous driving [5, 6]. A key challenge in deploying IL, however, is that the agent may encounter states in the final deployment environment that were not labeled by the expert offline [7]. In applications such as healthcare [8, 9] and robotics [10, 11], online experimentation can be risky (e.g., on human patients) or costly to label (e.g., off-policy robotic datasets can take months to collect).


Learning Reward Functions for Robotic Manipulation by Observing Humans

arXiv.org Artificial Intelligence

Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. Thanks to the diversity of this training data, the learned reward function sufficiently generalizes to image observations from a previously unseen robot embodiment and environment to provide a meaningful prior for directed exploration in reinforcement learning. We propose two methods for scoring states relative to a goal image: through direct temporal regression, and through distances in an embedding space obtained with time-contrastive learning. By conditioning the function on a goal image, we are able to reuse one model across a variety of tasks. Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.


On the importance of data collection for training general goal-reaching policies

arXiv.org Artificial Intelligence

Recent advances in ML suggest that the quantity of data available to a model is one of the primary bottlenecks to high performance. Although for language-based tasks there exist almost unlimited amounts of reasonably coherent data to train from, this is generally not the case for Reinforcement Learning, especially when dealing with a novel environment. In effect, even a relatively trivial continuous environment has an almost limitless number of states, but simply sampling random states and actions will likely not provide transitions that are interesting or useful for any potential downstream task. How should one generate massive amounts of useful data given only an MDP with no indication of downstream tasks? Are the quantity and quality of data truly transformative to the performance of a general controller? We propose to answer both of these questions. First, we introduce a principled unsupervised exploration method, ChronoGEM, which aims to achieve uniform coverage over the manifold of achievable states, which we believe is the most reasonable goal given no prior task information. Secondly, we investigate the effects of both data quantity and data quality on the training of a downstream goal-achievement policy, and show that both large quantities and high-quality of data are essential to train a general controller: a high-precision pose-achievement policy capable of attaining a large number of poses over numerous continuous control embodiments including humanoid.


Investigating the role of model-based learning in exploration and transfer

arXiv.org Artificial Intelligence

State of the art reinforcement learning has enabled training agents on tasks of ever increasing complexity. However, the current paradigm tends to favor training agents from scratch on every new task or on collections of tasks with a view towards generalizing to novel task configurations. The former suffers from poor data efficiency while the latter is difficult when test tasks are out-of-distribution. Agents that can effectively transfer their knowledge about the world pose a potential solution to these issues. In this paper, we investigate transfer learning in the context of model-based agents. Specifically, we aim to understand when exactly environment models have an advantage and why. We find that a model-based approach outperforms controlled model-free baselines for transfer learning. Through ablations, we show that both the policy and dynamics model learnt through exploration matter for successful transfer. We demonstrate our results across three domains which vary in their requirements for transfer: in-distribution procedural (Crafter), in-distribution identical (RoboDesk), and out-of-distribution (Meta-World). Our results show that intrinsic exploration combined with environment models present a viable direction towards agents that are self-supervised and able to generalize to novel reward functions.


A Geometric Perspective on Self-Supervised Policy Adaptation

arXiv.org Artificial Intelligence

One of the most challenging aspects of real-world reinforcement learning (RL) is the multitude of unpredictable and ever-changing distractions that could divert an agent from what was tasked to do in its training environment. While an agent could learn from reward signals to ignore them, the complexity of the real-world can make rewards hard to acquire, or, at best, extremely sparse. A recent class of self-supervised methods have shown promise that reward-free adaptation under challenging distractions is possible. However, previous work focused on a short one-episode adaptation setting. In this paper, we consider a long-term adaptation setup that is more akin to the specifics of the real-world and propose a geometric perspective on self-supervised adaptation. We empirically describe the processes that take place in the embedding space during this adaptation process, reveal some of its undesirable effects on performance and show how they can be eliminated. Moreover, we theoretically study how actor-based and actor-free agents can further generalise to the target environment by manipulating the geometry of the manifolds described by the actor and critic functions.


Model-Based Offline Planning

arXiv.org Artificial Intelligence

Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments.


RL Unplugged: Benchmarks for Offline Reinforcement Learning

arXiv.org Machine Learning

Offline methods for reinforcement learning have a potential to help bridge the gap between reinforcement learning research and real-world applications. They make it possible to learn policies from offline datasets, thus overcoming concerns associated with online data collection in the real-world, including cost, safety, or ethical concerns. In this paper, we propose a benchmark called RL Unplugged to evaluate and compare offline RL methods. RL Unplugged includes data from a diverse range of domains including games (e.g., Atari benchmark) and simulated motor control problems (e.g., DM Control Suite). The datasets include domains that are partially or fully observable, use continuous or discrete actions, and have stochastic vs. deterministic dynamics. We propose detailed evaluation protocols for each domain in RL Unplugged and provide an extensive analysis of supervised learning and offline RL methods using these protocols. We will release data for all our tasks and open-source all algorithms presented in this paper. We hope that our suite of benchmarks will increase the reproducibility of experiments and make it possible to study challenging tasks with a limited computational budget, thus making RL research both more systematic and more accessible across the community. Moving forward, we view RL Unplugged as a living benchmark suite that will evolve and grow with datasets contributed by the research community and ourselves. Our project page is available on https://git.io/JJUhd.


Differentiable Deep Clustering with Cluster Size Constraints

arXiv.org Machine Learning

Clustering is a fundamental unsupervised learning approach. Many clustering algorithms -- such as $k$-means -- rely on the euclidean distance as a similarity measure, which is often not the most relevant metric for high dimensional data such as images. Learning a lower-dimensional embedding that can better reflect the geometry of the dataset is therefore instrumental for performance. We propose a new approach for this task where the embedding is performed by a differentiable model such as a deep neural network. By rewriting the $k$-means clustering algorithm as an optimal transport task, and adding an entropic regularization, we derive a fully differentiable loss function that can be minimized with respect to both the embedding parameters and the cluster parameters via stochastic gradient descent. We show that this new formulation generalizes a recently proposed state-of-the-art method based on soft-$k$-means by adding constraints on the cluster sizes. Empirical evaluations on image classification benchmarks suggest that compared to state-of-the-art methods, our optimal transport-based approach provide better unsupervised accuracy and does not require a pre-training phase.