Reinforcement Learning
Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning
Voloshin, Cameron, Le, Hoang M., Jiang, Nan, Yue, Yisong
Off-policy policy evaluation (OPE) is the problem of estimating the online performance of a policy using only pre-collected historical data generated by another policy. Given the increasing interest in deploying learning-based methods for safety-critical applications, many recent OPE methods have recently been proposed. Due to disparate experimental conditions from recent literature, the relative performance of current OPE methods is not well understood. In this work, we present the first comprehensive empirical analysis of a broad suite of OPE methods. Based on thousands of experiments and detailed empirical analyses, we offer a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research.
Data-efficient Co-Adaptation of Morphology and Behaviour with Deep Reinforcement Learning
Luck, Kevin Sebastian, Amor, Heni Ben, Calandra, Roberto
Humans and animals are capable of quickly learning new behaviours to solve new tasks. Yet, we often forget that they also rely on a highly specialized morphology that co-adapted with motor control throughout thousands of years. Although compelling, the idea of co-adapting morphology and behaviours in robots is often unfeasible because of the long manufacturing times, and the need to re-design an appropriate controller for each morphology. In this paper, we propose a novel approach to automatically and efficiently co-adapt a robot morphology and its controller. Our approach is based on recent advances in deep reinforcement learning, and specifically the soft actor critic algorithm. Key to our approach is the possibility of leveraging previously tested morphologies and behaviors to estimate the performance of new candidate morphologies. As such, we can make full use of the information available for making more informed decisions, with the ultimate goal of achieving a more data-efficient co-adaptation (i.e., reducing the number of morphologies and behaviors tested). Simulated experiments show that our approach requires drastically less design prototypes to find good morphology-behaviour combinations, making this method particularly suitable for future co-adaptation of robot designs in the real world.
Reusable neural skill embeddings for vision-guided whole body movement and object manipulation
Merel, Josh, Tunyasuvunakool, Saran, Ahuja, Arun, Tassa, Yuval, Hasenclever, Leonard, Pham, Vu, Erez, Tom, Wayne, Greg, Heess, Nicolas
Both in simulation settings and robotics, there is an ambition to produce flexible control systems that can enable complex bodies to perform dynamic locomotion and natural object manipulation. In previous work, we developed a framework to train locomotor skills and reuse these skills for whole-body visuomotor tasks. Here, we extend this line of work to tasks involving whole body movement as well as visually guided manipulation of objects. This setting poses novel challenges in terms of task specification, exploration, and generalization. We develop an integrated approach consisting of a flexible motor primitive module, demonstrations, an instructed training regime as well as curricula in the form of task variations. We demonstrate the utility of our approach for solving challenging whole body tasks that require joint locomotion and manipulation, and characterize its behavioral robustness. We also provide a high-level overview video, see https://youtu.be/t0RDGSnE3cM .
Fully Parameterized Quantile Function for Distributional Reinforcement Learning
Yang, Derek, Zhao, Li, Lin, Zichuan, Qin, Tao, Bian, Jiang, Liu, Tieyan
Distributional Reinforcement Learning (RL) differs from traditional RL in that, rather than the expectation of total returns, it estimates distributions and has achieved state-of-the-art performance on Atari Games. The key challenge in practical distributional RL algorithms lies in how to parameterize estimated distributions so as to better approximate the true continuous distribution. Existing distributional RL algorithms parameterize either the probability side or the return value side of the distribution function, leaving the other side uniformly fixed as in C51, QR-DQN or randomly sampled as in IQN. In this paper, we propose fully parameterized quantile function that parameterizes both the quantile fraction axis (i.e., the x-axis) and the value axis (i.e., y-axis) for distributional RL. Our algorithm contains a fraction proposal network that generates a discrete set of quantile fractions and a quantile value network that gives corresponding quantile values. The two networks are jointly trained to find the best approximation of the true distribution. Experiments on 55 Atari Games show that our algorithm significantly outperforms existing distributional RL algorithms and creates a new record for the Atari Learning Environment for non-distributed agents.
Supplementary material for Uncorrected least-squares temporal difference with lambda-return
November 15, 2019 Abstract Here, we provide a supplementary material for Takayuki Osogami, "Uncorrected least-squares temporal difference with lambda-return," which appears in Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI-20) [Osogami, 2019]. A Proofs In this section, we prove Theorem 1, Lemma 1, Theorem 2, Lemma 2, and Proposition 1. Note that equations (1)-(19) refers to those in Osogami [2019]. A.1 Proof of Theorem 1 From (7)-(8), we have the following equality: A Unc T 1 T null t 0φ t null φ t (1 λ) γ T t null m 1( λγ) m 1 φ t mnull null (20) T 1 null t 0φ t null φ t (1 λ) γ T t null m 1(λγ) m 1 φ t mnull null φ T φ null T (21) T 1 null t 0φ tnull φ t (1 λ) γ T t 1 null m 1(λγ) m 1 φ t m (1 λ) γ (λγ) T t 1 φ Tnull null φ T φ null T (22) A Unc T T 1 null t 0φ t(1 λ) γ (λγ) T t 1 φ null T φ T φ null T (23) A Unc T null T null t 0(λγ) T t φ tnull φ null T γ null T 1 null t 0( λγ) T t 1 φ tnull φ null T (24) A Unc T ( z T γ z T 1) φ null T . The recursive computation of the eligibility trace can be verified in a straightforward manner.
Real-Time Reinforcement Learning
Ramstedt, Simon, Pal, Christopher
Markov Decision Processes (MDPs), the mathematical framework underlying most algorithms in Reinforcement Learning (RL), are often used in a way that wrongfully assumes that the state of an agent's environment does not change during action selection. As RL systems based on MDPs begin to find application in real-world safety critical situations, this mismatch between the assumptions underlying classical MDPs and the reality of real-time computation may lead to undesirable outcomes. In this paper, we introduce a new framework, in which states and actions evolve simultaneously and show how it is related to the classical MDP formulation. We analyze existing algorithms under the new real-time formulation and show why they are suboptimal when used in real-time. We then use those insights to create a new algorithm Real-Time Actor-Critic (RTAC) that outperforms the existing state-of-the-art continuous control algorithm Soft Actor-Critic both in real-time and non-real-time settings. Code and videos can be found at https://github.com/rmst/rtrl.
datamllab/rlcard
RLCard is a toolkit for Reinforcement Learning (RL) in card games. It supports multiple card environments with easy-to-use interfaces. The goal of RLCard is to bridge reinforcement learning and imperfect information games, and push forward the research of reinforcement learning in domains with multiple agents, large state and action space, and sparse reward. RLCard is developed by DATA Lab at Texas A&M University. We have just initialized a list of Awesome-Game-AI resources.
New robotic arm at University of Alberta to help students better understand artificial intelligence
Students at the University of Alberta are getting hands-on experience with artificial intelligence with a new robotic arm. Donated to the university's department of computing science by Kindred AI, a Canadian-based artificial intelligence company, the use of the robotic arm in the classroom helps students get a sense of reinforcement learning. Reinforcement learning is a branch of artificial intelligence, says Rapum Mahmood, assistant professor at the U of A and former Kindred AI research lead. "In reinforcement learning, we study by letting the agent interact with the environment, so that it can take the right set of actions," said Mahmood. Usually, the study is done through computer simulations and board games but in real-world applications, a robotic arm is used.
A Reduction from Reinforcement Learning to No-Regret Online Learning
Cheng, Ching-An, Combes, Remi Tachet des, Boots, Byron, Gordon, Geoff
We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which "any" online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard online-learning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generative-model oracle. For any $\gamma$-discounted tabular RL problem, with probability at least $1-\delta$, it learns an $\epsilon$-optimal policy using at most $\tilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\log(\frac{1}{\delta})}{(1-\gamma)^4\epsilon^2}\right)$ samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for large-scale applications, with computation and sample complexities independent of $|\mathcal{S}|$,$|\mathcal{A}|$, though at the cost of potential approximation bias.
Accelerating Training in Pommerman with Imitation and Reinforcement Learning
Meisheri, Hardik, Shelke, Omkar, Verma, Richa, Khadilkar, Harshad
The Pommerman simulation was recently developed to mimic the classic Japanese game Bomberman, and focuses on competitive gameplay in a multi-agent setting. We focus on the 2$\times$2 team version of Pommerman, developed for a competition at NeurIPS 2018. Our methodology involves training an agent initially through imitation learning on a noisy expert policy, followed by a proximal-policy optimization (PPO) reinforcement learning algorithm. The basic PPO approach is modified for stable transition from the imitation learning phase through reward shaping, action filters based on heuristics, and curriculum learning. The proposed methodology is able to beat heuristic and pure reinforcement learning baselines with a combined 100,000 training games, significantly faster than other non-tree-search methods in literature. We present results against multiple agents provided by the developers of the simulation, including some that we have enhanced. We include a sensitivity analysis over different parameters, and highlight undesirable effects of some strategies that initially appear promising. Since Pommerman is a complex multi-agent competitive environment, the strategies developed here provide insights into several real-world problems with characteristics such as partial observability, decentralized execution (without communication), and very sparse and delayed rewards.