Reinforcement Learning
Policy Shaping: Integrating Human Feedback with Reinforcement Learning
Griffith, Shane, Subramanian, Kaushik, Scholz, Jonathan, Isbell, Charles L., Thomaz, Andrea L.
A long term goal of Interactive Reinforcement Learning is to incorporate non-expert human feedback to solve complex tasks. State-of-the-art methods have approached this problem by mapping human information to reward and value signals to indicate preferences and then iterating over them to compute the necessary control policy. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct labels on the policy. We compare Advise to state-of-the-art approaches and highlight scenarios where it outperforms them and importantly is robust to infrequent and inconsistent human feedback.
Reward Mapping for Transfer in Long-Lived Agents
Guo, Xiaoxiao, Singh, Satinder, Lewis, Richard L.
We consider how to transfer knowledge from previous tasks to a current task in long-lived and bounded agents that must solve a sequence of MDPs over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent's behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent's performance relative to other approaches, including an approach that transfers policies.
Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
Gabillon, Victor, Ghavamzadeh, Mohammad, Scherrer, Bruno
Tetris is a popular video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A close look at the literature of this game shows that while ADP algorithms, that have been (almost) entirely based on approximating the value function (value function based), have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris. Our extensive experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small $10\times 10$ and large $10\times 20$ boards. Although the CBMPI's results are similar to those achieved by the CE method in the large board, CBMPI uses considerably fewer (almost 1/10) samples (call to the generative model of the game) than CE.
Actor-Critic Algorithms for Risk-Sensitive MDPs
L.A., Prashanth, Ghavamzadeh, Mohammad
In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application.
(More) Efficient Reinforcement Learning via Posterior Sampling
Osband, Ian, Russo, Daniel, Van Roy, Benjamin
Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an $\tilde{O}(\tau S \sqrt{AT})$ bound on the expected regret, where $T$ is time, $\tau$ is the episode length and $S$ and $A$ are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.
Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
Guez, Arthur, Silver, David, Dayan, Peter
Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems -- because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration.
Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search
Guez, A., Silver, D., Dayan, P.
Bayesian planning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, planning optimally in the face of uncertainty is notoriously taxing, since the search space is enormous. In this paper we introduce a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search. Our approach avoids expensive applications of Bayes rule within the search tree by sampling models from current beliefs, and furthermore performs this sampling in a lazy manner. This enables it to outperform previous Bayesian model-based reinforcement learning algorithms by a significant margin on several well-known benchmark problems. As we show, our approach can even work in problems with an infinite state space that lie qualitatively out of reach of almost all previous work in Bayesian exploration.
How to Abstract Intelligence? (If Verification Is in Order)
Pathak, Shashank (Istituto Italiano di Tecnologia) | Pulina, Luca (Università degli Studi di Sassari) | Metta, Giorgio (Istituto Italiano di Tecnologia) | Tacchella, Armando (Università degli Studi di Genova)
In this paper, we focus on learning intelligent agents through model-free reinforcement learning. Rather than arguing that reinforcement learning is the right abstraction for attaining intelligent behavior, we consider the issue of finding useful abstractions to represent the agent and the environment when verification is in order. Indeed, verifying that the agent’s behavior complies to some stated safety property — an ”Asimovian” perspective — only adds to the challenge that abstracting intelligence represents per se. In the paper, we show an example application about verification of abstractions in model-free learning, and we argue about potential (more) useful abstractions in the same context.
A Modular Reinforcement Learning Framework for Interactive Narrative Planning
Rowe, Jonathan P. (North Carolina State University) | Lester, James C. (North Carolina State University)
A key functionality provided by interactive narrative systems is narrative adaptation: tailoring story experiences in response to users’ actions and needs. We present a data-driven framework for dynamically tailoring events in interactive narratives using modular reinforcement learning. The framework involves decomposing an interactive narrative into multiple concurrent sub-problems, formalized as adaptable event sequences (AESs). Each AES is modeled as an independent Markov decision process (MDP). Policies for each MDP are induced using a corpus of user interaction data from an interactive narrative system with exploratory narrative adaptation policies. Rewards are computed based on users’ experiential outcomes. Conflicts between multiple policies are handled using arbitration procedures. In addition to introducing the framework, we describe a corpus of user interaction data from a testbed interactive narrative, CRYSTAL ISLAND, for inducing narrative adaptation policies. Empirical findings suggest that the framework can effectively shape users’ interactive narrative experiences.
Reinforcement Learning for Spatial Reasoning in Strategy Games
Leece, Michael A. (University of California, Santa Cruz) | Jhala, Arnav (University of California, Santa Cruz)
One of the major weaknesses of current real-time strategy (RTS) game agents is handling spatial reasoning at a high level. One challenge in developing spatial reasoning modules for RTS agents is to evaluate the ability of a given agent for this competency due to the inevitable confounding factors created by the complexity of these agents. We propose a simplified game that mimics spatial reasoning aspects of more complex games, while removing other complexities. Within this framework, we analyze the effectiveness of classical reinforcement learning for spatial management in order to build a detailed evaluative standard across a broad set of opponent strategies. We show that against a suite of opponents with fixed strategies, basic Q-learning is able to learn strategies to beat each. In addition, we demonstrate that performance against unseen strategies improves with prior training from other distinct strategies. We also test a modification of the basic algorithm to include multiple actors, to speed learning and increase scalability. Finally, we discuss the potential for knowledge transfer to more complex games with similar components.