Reinforcement Learning
Successor Features Support Model-based and Model-free Reinforcement Learning
Lehnert, Lucas, Littman, Michael L.
One key challenge in reinforcement learning is the ability to generalize knowledge in control problems. While deep learning methods have been successfully combined with model-free reinforcement-learning algorithms, how to perform model-based reinforcement learning in the presence of approximation errors still remains an open problem. Using successor features, a feature representation that predicts a temporal constraint, this paper presents three contributions: First, it shows how learning successor features is equivalent to model-free learning. Then, it shows how successor features encode model reductions that compress the state space by creating state partitions of bisimilar states. Using this representation, an intelligent agent is guaranteed to accurately predict future reward outcomes, a key property of model-based reinforcement-learning algorithms. Lastly, it presents a loss objective and prediction error bounds showing that accurately predicting value functions and reward sequences is possible with an approximation of successor features. On finite control problems, we illustrate how minimizing this loss objective results in approximate bisimulations. The results presented in this paper provide a novel understanding of representations that can support model-free and model-based reinforcement learning.
Learning Action Representations for Reinforcement Learning
Chandak, Yash, Theocharous, Georgios, Kostas, James, Jordan, Scott, Thomas, Philip S.
Most model-free reinforcement learning methods leverage state representations (embeddings) for generalization, but either ignore structure in the space of actions or assume the structure is provided a priori. We show how a policy can be decomposed into a component that acts in a low-dimensional space of action representations and a component that transforms these representations into actual actions. These representations improve generalization over large, finite action sets by allowing the agent to infer the outcomes of actions similar to actions already taken. We provide an algorithm to both learn and use action representations and provide conditions for its convergence. The efficacy of the proposed method is demonstrated on large-scale real-world problems.
Privacy Preserving Off-Policy Evaluation
Xie, Tengyang, Thomas, Philip S., Miklau, Gerome
Many proposed applications of reinforcement learning (RL) involve the use of data that could contain sensitive information. For example, Raghu et al. [2017] proposed an application of RL and off-policy evaluation methods that uses peoples' medical records, and Theocharous et al. [2015] applied off-policy evaluation methods to user data collected by a bank in order to improve the targeting of advertisements. In examples like these, the data used by the RL systems is sensitive, and one should ensure that the methods applied to the data do not leak any sensitive information. Recently, Balle et al. [2016] showed how techniques from differential privacy can be used to ensure that (with high probability) policy evaluation methods for RL do not leak (much) sensitive information. In this paper we extend their work in two ways.
A Geometric Perspective on Optimal Representations for Reinforcement Learning
Bellemare, Marc G., Dabney, Will, Dadashi, Robert, Taiga, Adrien Ali, Castro, Pablo Samuel, Roux, Nicolas Le, Schuurmans, Dale, Lattimore, Tor, Lyle, Clare
This paper proposes a new approach to representation learning based on geometric properties of the space of value functions. We study a two-part approximation of the value function: a nonlinear map from states to vectors, or representation, followed by a linear map from vectors to values. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We argue that these AVFs make excellent auxiliary tasks, and use them to construct a loss which can be efficiently minimized to find a near-optimal representation for reinforcement learning. We highlight characteristics of the method in a series of experiments on the four-room domain.
Contrasting Exploration in Parameter and Action Space: A Zeroth-Order Optimization Perspective
Vemula, Anirudh, Sun, Wen, Bagnell, J. Andrew
Black-box optimizers that explore in parameter space have often been shown to outperform more sophisticated action space exploration methods developed specifically for the reinforcement learning problem. We examine these black-box methods closely to identify situations in which they are worse than action space exploration methods and those in which they are superior. Through simple theoretical analyses, we prove that complexity of exploration in parameter space depends on the dimensionality of parameter space, while complexity of exploration in action space depends on both the dimensionality of action space and horizon length. This is also demonstrated empirically by comparing simple exploration methods on several model problems, including Contextual Bandit, Linear Regression and Reinforcement Learning in continuous control.
An Optimization Framework for Task Sequencing in Curriculum Learning
Foglino, Francesco, Leonetti, Matteo
Abstract--Curriculum learning is gaining popularity in (deep) reinforcement learning. It can alleviate the burden on data collection and provide better exploration policies through transfer and generalization from less complex tasks. Current methods for automatic task sequencing for curriculum learning in reinforcement learning provided initial heuristic solutions, with little to no guarantee on their quality. We introduce an optimization framework for task sequencing composed of the problem definition, several candidate performance metrics for optimization, and three benchmark algorithms. We experimentally show that the two most commonly used baselines (learning with no curriculum, and with a random curriculum) perform worse than a simple greedy algorithm. Furthermore, we show theoretically and demonstrate experimentally that the three proposed algorithms provide increasing solution quality at moderately increasing computational complexity, and show that they constitute better baselines for curriculum learning in reinforcement learning. Reinforcement Learning (RL) has recently been successfully applied to a number of tasks whose complexity would have appeared overwhelming only a few years ago [1], [2]. In such large and complex environments, classical exploration strategies designed for Markov Decision Processes (MDPs), aiming at visiting every state the most efficiently, are inadequate. One approach actively investigated is the use of transfer learning [3] to generalize from previous similar tasks, and more recently the application of transfer learning to sequences of tasks of increasing complexity forming a curriculum . Curriculum Learning is often employed in (Deep) RL [4], [5] to let the agent progress more quickly towards better behaviors, but curricula are mostly designed by hand. Curriculum learning has the potential to greatly increase the quality of the behavior discovered by the agent. However, at the moment, creating an appropriate curriculum requires significant human intuition.
A Theory of Regularized Markov Decision Processes
Geist, Matthieu, Scherrer, Bruno, Pietquin, Olivier
Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or on Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.
CLIC: Curriculum Learning and Imitation for feature Control in non-rewarding environments
Fournier, Pierre, Sigaud, Olivier, Colas, Cédric, Chetouani, Mohamed
In this paper, we propose an unsupervised reinforcement learning agent called CLIC for Curriculum Learning and Imitation for Control. This agent learns to control features in its environment without external rewards, and observes the actions of a third party agent, Bob, who does not necessarily provide explicit guidance. CLIC selects which feature to train on and what to imitate from Bob's behavior by maximizing its learning progress. We show that CLIC can effectively identify helpful behaviors in Bob's actions, and imitate them to control the environment faster. CLIC can also follow Bob when he acts as a mentor and provides ordered demonstrations. Finally, when Bob controls features than the agent cannot, or in presence of a hierarchy between aspects of the environment, we show that CLIC ignores non-reproducible and already mastered behaviors, resulting in a greater benefit from imitation.
Learning Independently-Obtainable Reward Functions
Grimm, Christopher, Singh, Satinder
We present a novel method for learning a set of disentangled reward functions that sum to the original environment reward and are constrained to be independently obtainable. We define independent obtainability in terms of value functions with respect to obtaining one learned reward while pursuing another learned reward. Empirically, we illustrate that our method can learn meaningful reward decompositions in a variety of domains and that these decompositions exhibit some form of generalization performance when the environment's reward is modified. Theoretically, we derive results about the effect of maximizing our method's objective on the resulting reward functions and their corresponding optimal policies.