Reinforcement Learning
Scaling All-Goals Updates in Reinforcement Learning Using Convolutional Neural Networks
Pardo, Fabio, Levdik, Vitaly, Kormushev, Petar
Being able to reach any desired location in the environment can be a valuable asset for an agent. Learning a policy to navigate between all pairs of states individually is often not feasible. An all-goals updating algorithm uses each transition to learn Q-values towards all goals simultaneously and off-policy. However the expensive numerous updates in parallel limited the approach to small tabular cases so far. To tackle this problem we propose to use convolutional network architectures to generate Q-values and updates for a large number of goals at once. We demonstrate the accuracy and generalization qualities of the proposed method on randomly generated mazes and Sokoban puzzles. In the case of on-screen goal coordinates the resulting mapping from frames to distance-maps directly informs the agent about which places are reachable and in how many steps. As an example of application we show that replacing the random actions in epsilon-greedy exploration by several actions towards feasible goals generates better exploratory trajectories on Montezuma's Revenge and Super Mario All-Stars games.
Bootstrapping a DQN Replay Memory with Synthetic Experiences
von Pilchau, Wenzel Baron Pilar, Stein, Anthony, Hähner, Jörg
An important component of many Deep Reinforcement Learning algorithms is the Experience Replay which serves as a storage mechanism or memory of made experiences. These experiences are used for training and help the agent to stably find the perfect trajectory through the problem space. The classic Experience Replay however makes only use of the experiences it actually made, but the stored samples bear great potential in form of knowledge about the problem that can be extracted. We present an algorithm that creates synthetic experiences in a nondeterministic discrete environment to assist the learner. The Interpolated Experience Replay is evaluated on the FrozenLake environment and we show that it can support the agent to learn faster and even better than the classic version.
Finite Time Analysis of Linear Two-timescale Stochastic Approximation with Markovian Noise
Kaledin, Maxim, Moulines, Eric, Naumov, Alexey, Tadic, Vladislav, Wai, Hoi-To
Linear two-timescale stochastic approximation (SA) scheme is an important class of algorithms which has become popular in reinforcement learning (RL), particularly for the policy evaluation problem. Recently, a number of works have been devoted to establishing the finite time analysis of the scheme, especially under the Markovian (non-i.i.d.) noise settings that are ubiquitous in practice. In this paper, we provide a finite-time analysis for linear two timescale SA. Our bounds show that there is no discrepancy in the convergence rate between Markovian and martingale noise, only the constants are affected by the mixing time of the Markov chain. With an appropriate step size schedule, the transient term in the expected error bound is o (1 /k c) and the steady-state term is O (1 /k), where c 1 and k is the iteration number. Furthermore, we present an asymptotic expansion of the expected error with a matching lower bound of Ω(1 /k). A simple numerical experiment is presented to support our theory. Keywords: stochastic approximation, reinforcement learning, GTD learning, Markovian noise 1. Introduction Since its introduction close to 70 years ago, the stochastic approximation (SA) scheme (Robbins and Monro, 1951) has been a powerful tool for root finding when only noisy samples are available. During the past two decades, considerable progresses in the practical and theoretical research of SA have been made, see (Bena ım, 1999; Kushner and Yin, 2003; Borkar, 2008) for an overview. Among others, linear SA schemes are popular in reinforcement learning (RL) as they lead to policy evaluation methods with linear function approximation, of particular importance is temporal difference (TD) learning (Sutton, 1988) for which finite time analysis has been reported in (Srikant and Ying, 2019; Lakshminarayanan and Szepesvari, 2018; Bhandari et al., 2018; Dalal et al., 2018a). The TD learning scheme based on classical (linear) SA is known to be inadequate for the off-policy learning paradigms in RL, where data samples are drawn from a behavior policy different from the policy being evaluated (Baird, 1995; Tsitsiklis and V an Roy, 1997). To circumvent this Authors listed in alphabetical order. These methods fall within the scope of linear two-timescale SA scheme introduced by Borkar (1997): θ k 1 θ k β k{null b 1( X k 1) null A 11(X k 1)θ k null A 12(X k 1) w k}, (1) w k 1 w k γ k{null b 2( X k 1) null A 21( X k 1)θ k null A 22(X k 1)w k}.
r/MachineLearning - [R] Rotation, Translation, and Cropping for Zero-Shot Generalization
Abstract: Deep Reinforcement Learning (DRL) has shown impressive performance on domains with visual inputs, in particular various games. However, the agent is usually trained on a fixed environment, e.g. a fixed number of levels. A growing mass of evidence suggests that these trained models fail to generalize to even slight variations of the environments they were trained on. This paper advances the hypothesis that the lack of generalization is partly due to the input representation, and explores how rotation, cropping and translation could increase generality. We show that a cropped, translated and rotated observation can get better generalization on unseen levels of a two- dimensional arcade game. The generality of the agent is evaluated on a set of human-designed levels.
r/MachineLearning - [R] Procedural Content Generation via Reinforcement Learning
Abstract: We investigate how reinforcement learning can be used to train level-designing agents. This represents a new approach to procedural content generation in games, where level design is framed as a game, and the content generator itself is learned. By seeing the design problem as a sequential task, we can use reinforcement learning to learn how to take the next action so that the expected final level quality is maximized. This approach can be used when few or no examples exist to train from, and the trained generator is very fast. We investigate three different ways of transforming two-dimensional level design problems into Markov decision processes and apply these to three game environments.
Neuro-evolutionary Frameworks for Generalized Learning Agents
The ultimate aim of artificial intelligence research is to develop agents with truly intelligent behaviors, akin to those found in humans and animals. To this end, a number of tools and techniques have been developed. In recent years, two approaches in particular - deep learning (DL) and reinforcement learning (RL), seem to have made considerable progress towards this goal. Both these fields have been widely studied, with numerous successful examples [22, 29, 42, 25, 40] reported, particularly in recent years. However, even with the unprecedented success of recent approaches such as deep RL [28, 27, 36], poor sample efficiency and limited generalization remain major concerns to be addressed, keeping in view the ultimate goal of developing general purpose agents. The poor generalization capability of DL is exposed by its liability to deception when presented with adversarial examples [30, 39]. Recent work [38], showed that it was possible to hurt the performance of DLbased image recognition systems by carefully altering just a single pixel.
Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems with Black Box Simulators
Sreedharan, Sarath, Soni, Utkash, Verma, Mudit, Srivastava, Siddharth, Kambhampati, Subbarao
As more and more complex AI systems are introduced into our day-to-day lives, it becomes important that everyday users can work and interact with such systems with relative ease. Orchestrating such interactions require the system to be capable of providing explanations and rationale for its decisions and be able to field queries about alternative decisions. A significant hurdle to allowing for such explanatory dialogue could be the mismatch between the complex representations that the systems use to reason about the task and the terms in which the user may be viewing the task. This paper introduces methods that can be leveraged to provide contrastive explanations in terms of user-specified concepts for deterministic sequential decision-making settings where the system dynamics may be best represented in terms of black box simulators. We do this by assuming that system dynamics can at least be partly captured in terms of symbolic planning models, and we provide explanations in terms of these models. We implement this method using a simulator for a popular Atari game (Montezuma's Revenge) and perform user studies to verify whether people would find explanations generated in this form useful.
Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes
Chen, Zaiwei, Maguluri, Siva Theja, Shakkottai, Sanjay, Shanmugam, Karthikeyan
Stochastic Approximation (SA) is a popular approach for solving fixed point equations where the information is corrupted by noise. In this paper, we consider an SA involving a contraction mapping with respect to an arbitrary norm, and show its finite-sample bound for using either constant or diminishing step sizes. The idea is to construct a smooth Lyapunov function using the generalized Moreau envelope, and show that the iterates of SA are contracting in expectation with respect to that Lyapunov function. The result is applicable to various Reinforcement Learning (RL) algorithms. In particular, we use it to establish the first-known convergence rate of the V-trace algorithm for the off-policy TD-Learning [15], and improve the existing bound for the tabular $Q$-Learning algorithm. Further, for these two applications, our construction of the Lyapunov functions results in only a logarithmic dependence of the convergence bound on the state-space dimension.
Deep Reinforcement Learning for Autonomous Driving: A Survey
Kiran, B Ravi, Sobh, Ibrahim, Talpaert, Victor, Mannion, Patrick, Sallab, Ahmad A. Al, Yogamani, Senthil, Pérez, Patrick
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms, provides a taxonomy of automated driving tasks where (D)RL methods have been employed, highlights the key challenges algorithmically as well as in terms of deployment of real world autonomous driving agents, the role of simulators in training agents, and finally methods to evaluate, test and robustifying existing solutions in RL and imitation learning.
Integrating Deep Reinforcement Learning with Model-based Path Planners for Automated Driving
Yurtsever, Ekim, Capito, Linda, Redmill, Keith, Ozguner, Umit
Automated driving in urban settings is challenging chiefly due to the indeterministic nature of the human participants of the traffic. These behaviors are difficult to model, and conventional, rule-based Automated Driving Systems (ADSs) tend to fail when they face unmodeled dynamics. On the other hand, the more recent, end-to-end Deep Reinforcement Learning (DRL) based ADSs have shown promising results. However, pure learning-based approaches lack the hard-coded safety measures of model-based methods. Here we propose a hybrid approach that integrates a model-based path planner into a vision based DRL framework to alleviate the shortcomings of both worlds. In summary, the DRL agent learns to overrule the model-based planner's decisions if it predicts that better future rewards can be obtained while doing so, e.g., avoiding an accident. Otherwise, the DRL agent tends to follow the model-based planner as close as possible. This logic is learned, i.e., no switching model is designed here. The agent learns this by considering two penalties: the penalty of straying away from the model-based path planner and the penalty of having a collision. The latter has precedence over the former, i.e., the penalty is greater. Therefore, after training, the agent learns to follow the model-based planner when it is safe to do so, otherwise, it gets penalized. However, it also learns to sacrifice positive rewards for following the model-based planner to avoid a potential big negative penalty for making a collision in the future. Experimental results show that the proposed method can plan its path and navigate while avoiding obstacles between randomly chosen origin-destination points in CARLA, a dynamic urban simulation environment. Our code is open-source and available online.