Escontrela, Alejandro
Learning a Diffusion Model Policy from Rewards via Q-Score Matching
Psenka, Michael, Escontrela, Alejandro, Abbeel, Pieter, Ma, Yi
Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we focus on off-policy reinforcement learning and propose a new method for learning a diffusion model policy that exploits the linked structure between the score of the policy and the action gradient of the Q-function. We denote this method Q-score matching and provide theoretical justification for this approach. We conduct experiments in simulated environments to demonstrate the effectiveness of our proposed method and compare to popular baselines.
Video Prediction Models as Rewards for Reinforcement Learning
Escontrela, Alejandro, Adeniji, Ademi, Yan, Wilson, Jain, Ajay, Peng, Xue Bin, Goldberg, Ken, Lee, Youngwoon, Hafner, Danijar, Abbeel, Pieter
Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://escontrela.me/viper
Barkour: Benchmarking Animal-level Agility with Quadruped Robots
Caluwaerts, Ken, Iscen, Atil, Kew, J. Chase, Yu, Wenhao, Zhang, Tingnan, Freeman, Daniel, Lee, Kuang-Huei, Lee, Lisa, Saliceti, Stefano, Zhuang, Vincent, Batchelor, Nathan, Bohez, Steven, Casarini, Federico, Chen, Jose Enrique, Cortes, Omar, Coumans, Erwin, Dostmohamed, Adil, Dulac-Arnold, Gabriel, Escontrela, Alejandro, Frey, Erik, Hafner, Roland, Jain, Deepali, Jyenis, Bauyrjan, Kuang, Yuheng, Lee, Edward, Luu, Linda, Nachum, Ofir, Oslund, Ken, Powell, Jason, Reyes, Diego, Romano, Francesco, Sadeghi, Feresteh, Sloat, Ron, Tabanpour, Baruch, Zheng, Daniel, Neunert, Michael, Hadsell, Raia, Heess, Nicolas, Nori, Francesco, Seto, Jeff, Parada, Carolina, Sindhwani, Vikas, Vanhoucke, Vincent, Tan, Jie
Abstract--Animals have evolved various agile locomotion strategies, such as sprinting, leaping, and jumping. There is a growing interest in developing legged robots that move like their biological counterparts and show various agile skills to navigate complex environments quickly. Despite the interest, the field lacks systematic benchmarks to measure the performance of control policies and hardware in agility. We introduce the Barkour benchmark, an obstacle course to quantify agility for legged robots. Inspired by dog agility competitions, it consists of diverse obstacles and a time based scoring mechanism. This encourages researchers to develop controllers that not only move fast, but do so in a controllable and versatile way. To set strong baselines, we present two methods for tackling the benchmark. In the first approach, we train specialist locomotion skills using on-policy reinforcement learning methods and combine them with a highlevel navigation controller. In the second approach, we distill the specialist skills into a Transformer-based generalist locomotion policy, named Locomotion-Transformer, that can handle various terrains and adjust the robot's gait based on the perceived There has been a proliferation of legged robot development inspired by animal mobility. An important research question in this field is how to develop a controller that enables legged robots to exhibit animal-level agility while also being able to generalize environments, such as up and down stairs, through bushes, across various obstacles and terrains. Through the exploration and over unpaved roads and rocky or even sandy beaches. of both learning and traditional control-based methods, there Despite advances in robot hardware and control, a major has been significant progress in enabling robots to walk across challenge in the field is the lack of standardized and intuitive a wide range of terrains [10, 21, 20, 1, 27]. These robots are methods for evaluating the effectiveness of locomotion now capable of walking in a variety of indoor and outdoor controllers.