Reinforcement Learning
Rainbow: Combining Improvements in Deep Reinforcement Learning
Hessel, Matteo (DeepMind) | Modayil, Joseph (DeepMind) | Hasselt, Hado van (DeepMind) | Schaul, Tom (DeepMind) | Ostrovski, Georg (DeepMind) | Dabney, Will (DeepMind) | Horgan, Dan (DeepMind) | Piot, Bilal (DeepMind) | Azar, Mohammad (DeepMind) | Silver, David (DeepMind)
The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.
Deep Reinforcement Learning That Matters
Henderson, Peter (McGill University) | Islam, Riashat (McGill University) | Bachman, Philip (Microsoft) | Pineau, Joelle (McGill University) | Precup, Doina (McGill University) | Meger, David (McGill University)
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.
OptionGAN: Learning Joint Reward-Policy Options Using Generative Adversarial Inverse Reinforcement Learning
Henderson, Peter (McGill University) | Chang, Wei-Di (McGill University) | Bacon, Pierre-Luc (McGill University) | Meger, David (McGill University) | Pineau, Joelle (McGill University) | Precup, Doina (McGill University)
Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.
Reinforced Multi-Label Image Classification by Exploring Curriculum
He, Shiyi (Peking University) | Xu, Chang (UBTECH Sydney AI Centre, SIT, FEIT, University of Sydney) | Guo, Tianyu (Peking University) | Xu, Chao (Peking University) | Tao, Dacheng (UBTECH Sydney AI Centre, SIT, FEIT, University of Sydney)
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Inspired by this curriculum learning mechanism, we propose a reinforced multi-label image classification approach imitating human behavior to label image from easy to complex. This approach allows a reinforcement learning agent to sequentially predict labels by fully exploiting image feature and previously predicted labels. The agent discovers the optimal policies through maximizing the long-term reward which reflects prediction accuracies. Experimental results on PASCAL VOC2007 and 2012 demonstrate the necessity of reinforcement multi-label learning and the algorithm’s effectiveness in real-world multi-label image classification tasks.
Learning With Options That Terminate Off-Policy
Harutyunyan, Anna (Vrije Universiteit Brussel) | Vrancx, Peter (PROWLER.io) | Bacon, Pierre-Luc (McGill University) | Precup, Doina (McGill University) | Nowé, Ann (Vrije Universiteit Brussel)
A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides the option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy well, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(beta), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(beta) by casting learning with options into a common framework with well-studied multi-step off policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.
When Waiting Is Not an Option: Learning Options With a Deliberation Cost
Harb, Jean (McGill University) | Bacon, Pierre-Luc (McGill University) | Klissarov, Martin (McGill University) | Precup, Doina (McGill University)
This perspective Temporal abstraction has a rich history in AI (Minsky 1961; helps us to formulate more precisely what objective Fikes et al. 1972; Kuipers 1979; Korf 1983; Iba 1989; criteria should be fulfilled during option construction. We Drescher 1991; Dayan and Hinton 1992; Kaelbling 1993; propose that good options are those which allow an agent to Thrun and Schwartz 1995; Parr and Russell 1998; Dietterich learn and plan faster, and provide an optimization objective 1998) and has been presented as a useful mechanism for for learning options based on this idea. We implement the a variety of problems that affect AI systems in may settings, optimization using the option-critic framework (Bacon et al. including to: generate shorter plans, speed up planning, 2017) and illustrate its usefulness with experiments in Atari improve generalization, yield better exploration, increase games.
Multi-Step Reinforcement Learning: A Unifying Algorithm
Asis, Kristopher De (University of Alberta) | Hernandez-Garcia, J. Fernando (University of Alberta) | Holland, G. Zacharias (University of Alberta) | Sutton, Richard S. (University of Alberta )
Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, Q-learning, and Expected Sarsa. These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. In this paper, we study a new multi-step action-value algorithm called Q(σ) that unifies and generalizes these existing algorithms, while subsuming them as special cases. A new parameter, σ, is introduced to allow the degree of sampling performed by the algorithm at each step during its backup to be continuously varied, with Sarsa existing at one extreme (full sampling), and Expected Sarsa existing at the other (pure expectation). Q(σ) is generally applicable to both on- and off-policy learning, but in this work we focus on experiments in the on-policy case. Our results show that an intermediate value of σ, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance.
Distributional Reinforcement Learning With Quantile Regression
Dabney, Will (DeepMind) | Rowland, Mark (University of Cambridge) | Bellemare, Marc G. (Google Brain) | Munos, Rémi (DeepMind)
In reinforcement learning (RL), an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51.
Expected Policy Gradients
Ciosek, Kamil (University of Oxford) | Whiteson, Shimon (University of Oxford)
We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates across the action when estimating the gradient, instead of relying only on the action in the sampled trajectory. We establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. We also prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead. Finally, we show that it is optimal in a certain sense to explore with a Gaussian policy such that the covariance is proportional to the exponential of the scaled Hessian of the critic with respect to the actions. We present empirical results confirming that this new form of exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic in four challenging MuJoCo domains.
Gated-Attention Architectures for Task-Oriented Language Grounding
Chaplot, Devendra Singh (Carnegie Mellon University) | Sathyendra, Kanthashree Mysore (Carnegie Mellon University, Language Technologies Institute) | Pasumarthi, Rama Kumar (Carnegie Mellon University, Language Technologies Institute) | Rajagopal, Dheeraj (Carnegie Mellon University, Language Technologies Institute) | Salakhutdinov, Ruslan (Carnegie Mellon University)
To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.