Reinforcement Learning
Using Bisimulation for Policy Transfer in MDPs
Castro, Pablo Samuel (McGill University) | Precup, Doina (McGill University)
Knowledge transfer has been suggested as a useful approach for solving large Markov Decision Processes. The main idea is to compute a decision-making policy in one environment and use it in a different environment, provided the two are ”close enough”. In this paper, we use bisimulation-style metrics (Ferns et al., 2004) to guide knowledge transfer. We propose algorithms that decide what actions to transfer from the policy computed on a small MDP task to a large task, given the bisimulation distance between states in the two tasks. We demonstrate the inherent ”pessimism” of bisimulation metrics and present variants of this metric aimed to overcome this pessimism, leading to improved action transfer. We also show that using this approach for transferring temporally extended actions (Sutton et al., 1999) is more successful than using it exclusively with primitive actions. We present theoretical guarantees on the quality of the transferred policy, as well as promising empirical results.
Integrating Sample-Based Planning and Model-Based Reinforcement Learning
Walsh, Thomas J. (Rutgers University) | Goschin, Sergiu (Rutgers University) | Littman, Michael L. (Rutgers University)
Recent advancements in model-based reinforcement learning have shown that the dynamics of many structured domains (e.g. DBNs) can be learned with tractable sample complexity, despite their exponentially large state spaces. Unfortunately, these algorithms all require access to a planner that computes a near optimal policy, and while many traditional MDP algorithms make this guarantee, their computation time grows with the number of states. We show how to replace these over-matched planners with a class of sample-based planners — whose computation time is independent of the number of states — without sacrificing the sample-efficiency guarantees of the overall learning algorithms. To do so, we define sufficient criteria for a sample-based planner to be used in such a learning system and analyze two popular sample-based approaches from the literature. We also introduce our own sample-based planner, which combines the strategies from these algorithms and still meets the criteria for integration into our learning system. In doing so, we define the first complete RL solution for compactly represented (exponentially sized) state spaces with efficiently learnable dynamics that is both sample efficient and whose computation time does not grow rapidly with the number of states.
Reinforcement Learning via AIXI Approximation
Veness, Joel (University of New South Wales and NICTA) | Ng, Kee Siong (Medicare Australia and Australian National University) | Hutter, Marcus (Australian National University and NICTA) | Silver, David (University College London)
This paper introduces a principled approach for the design of a scalable general reinforcement learning agent. This approach is based on a direct approximation of AIXI, a Bayesian optimality notion for general reinforcement learning agents. Previously, it has been unclear whether the theory of AIXI could motivate the design of practical algorithms. We answer this hitherto open question in the affirmative, by providing the first computationally feasible approximation to the AIXI agent. To develop our approximation, we introduce a Monte Carlo Tree Search algorithm along with an agent-specific extension of the Context Tree Weighting algorithm. Empirically, we present a set of encouraging results on a number of stochastic, unknown, and partially observable domains.
Reinforcement Learning Via Practice and Critique Advice
Judah, Kshitij (Oregon State University) | Roy, Saikat (Oregon State University) | Fern, Alan (Oregon State University) | Dietterich, Thomas G. (Oregon State University)
We consider the problem of incorporating end-user advice into reinforcement learning (RL). In our setting, the learner alternates between practicing, where learning is based on actual world experience, and end-user critique sessions where advice is gathered. During each critique session the end-user is allowed to analyze a trajectory of the current policy and then label an arbitrary subset of the available actions as good or bad. Our main contribution is an approach for integrating all of the information gathered during practice and critiques in order to effectively optimize a parametric policy. The approach optimizes a loss function that linearly combines losses measured against the world experience and the critique data. We evaluate our approach using a prototype system for teaching tactical battle behavior in a real-time strategy game engine. Results are given for a significant evaluation involving ten end-users showing the promise of this approach and also highlighting challenges involved in inserting end-users into the RL loop.
Local Optimization for Simulation of Natural Motion
Erez, Tom (Washington University in St. Louis)
I intend to use RL to bring the two together, The Reinforcement Learning (RL) agent interacts with a dynamical and generate motion from the proposed first principles system whose states capture all the relevant information in realistic biomechanical models, and compare the about the current configuration of the agent and its results to the behavior of living creatures. This is a nontrivial environment. By specifying a sequence of actions, the agent problem: biomechanical models are continuous, highdimensional alters the state transitions of this dynamical system. The optimality and nonlinear, and the optimality criteria considered criterion is formalized by a reward function defined in the literature are non-quadratic. In order to address over state-action pairs, and the agent's goal is to maximize these profound challenges, I propose three basic principles the cumulative reward.
Automatic Methods for Continuous State Space Abstraction
Loscalzo, Steven (Air Force Research Laboratory Information Directorate) | Wright, Robert (Air Force Research Laboratory Information Directorate)
Reinforcement learning algorithms are often tasked with learning an optimal control policy in a continuous state space. Since it is infeasible to learn the optimal action to take for every possible observation in a continuous state space, use- ful abstractions of the space must be constructed and subse- quently learned on. Abstraction techniques that generalize the space into very few abstract states must take care to avoid creating an abstraction that prevents learning the optimal policy. Many commonly used abstractions, such as CMAC can take considerable effort to tune to ensure a learnable abstraction is created. In this work we propose three methods that derive state abstractions automatically, in part by making use of the dimensionality reduction capability of the RL-SANE algorithm. We show that abstractions derived from these automatic methods can allow a learning algorithm to converge to the optimal policy faster than with a fixed abstraction. Ad- ditionally, these techniques are able to break the space into very few abstract states, further facilitating rapid learning.
Evolutionary Tile Coding: An Automated State Abstraction Algorithm for Reinforcement Learning
Lin, Stephen (Air Force Research Laboratory ‚ Information Directorate) | Wright, Robert (Air Force Research Laboratory ‚ Information Directorate)
Reinforcement learning (RL) algorithms have the ability to learn optimal policies for control problems by exploring a domain's state space. Unfortunately, for most problems the size of the state space is too great for RL technologies to fully explore in order to find good policies. State abstraction is one way of reducing the size and complexity of a domain's state space in order to enable RL. In this paper we introduce a new approach for automatically deriving state abstractions called Evolutionary Tile Coding that uses a genetic algorithm for deriving effective tile codings. We provide an empirical analysis of the new algorithm comparing it to another adaptive tile coding method as well as fixed tile coding. Our results show that our approach is able to automatically derive effective state abstractions for two RL benchmark problems. Additionally, we present an intriguing result that shows the classical mountain car problem's state space can be reduced to just two states and still preserve the discovery of an optimal policy.
Algorithms for Reinforcement Learning
Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations.