Goto

Collaborating Authors

 td learning




Gaussian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning

Neural Information Processing Systems

In this paper, we obtain the Berryโ€“Esseen bound for multivariate normal approximation for the Polyak-Ruppert averaged iterates of the linear stochastic approximation (LSA) algorithm with decreasing step size. Moreover, we prove the non-asymptotic validity of the confidence intervals for parameter estimation with LSA based on multiplier bootstrap. This procedure updates the LSA estimate together with a set of randomly perturbed LSA estimates upon the arrival of subsequent observations. We illustrate our findings in the setting of temporal difference learning with linear function approximation.


PID Accelerated Temporal Difference Algorithms

arXiv.org Machine Learning

Long-horizon tasks, which have a large discount factor, pose a challenge for most conventional reinforcement learning (RL) algorithms. Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks. When the transition distributions are given, PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory. Inspired by this, we introduce PID TD Learning and PID Q-Learning algorithms for the RL setting in which only samples from the environment are available. We give theoretical analysis of their convergence and acceleration compared to their traditional counterparts. We also introduce a method for adapting PID gains in the presence of noise and empirically verify its effectiveness.


One-Shot Averaging for Distributed TD($\lambda$) Under Markov Sampling

arXiv.org Artificial Intelligence

Actor-critic method achieves state-of-the-art performance in many domains including robotics, game playing, and control systems (LeCun et al. (2015); Mnih et al. (2016); Silver et al. (2017)). Temporal Difference (TD) Learning may be thought of as a component of actor critic, and better bounds for TD Learning are usually ingredients of actor-critic analyses. We consider the problem of policy evaluation in reinforcement learning: given a Markov Decision Process (MDP) and a policy, we need to estimate the value of each state (expected discounted sum of all future rewards) under this policy. Policy evaluation is important because it is effectively a subroutine of many other algorithms such as policy iteration and actor-critic. The main challenges for policy evaluation are that we usually do not know the underlying MDP directly and can only interact with it, and that the number of states is typically too large forcing us to maintain a low-dimensional approximation of the true vector of state values.


Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards

arXiv.org Artificial Intelligence

In a broad class of reinforcement learning applications, stochastic rewards have heavy-tailed distributions, which lead to infinite second-order moments for stochastic (semi)gradients in policy evaluation and direct policy optimization. In such instances, the existing RL methods may fail miserably due to frequent statistical outliers. In this work, we establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism, and correspondingly operated natural actor-critic (NAC), can be provably robustified against heavy-tailed reward distributions. It is shown in the framework of linear function approximation that a favorable tradeoff between bias and variability of the stochastic gradients can be achieved with this dynamic gradient clipping mechanism. In particular, we prove that robust versions of TD learning achieve sample complexities of order $\mathcal{O}(\varepsilon^{-\frac{1}{p}})$ and $\mathcal{O}(\varepsilon^{-1-\frac{1}{p}})$ with and without the full-rank assumption on the feature matrix, respectively, under heavy-tailed rewards with finite moments of order $(1+p)$ for some $p\in(0,1]$, both in expectation and with high probability. We show that a robust variant of NAC based on Robust TD learning achieves $\tilde{\mathcal{O}}(\varepsilon^{-4-\frac{2}{p}})$ sample complexity. We corroborate our theoretical results with numerical experiments.


Learning sparse representations in reinforcement learning

arXiv.org Artificial Intelligence

Jacob Rafati, David C. Noelle Electrical Engineering and Computer Scinence Computational Cognitive Neuroscience Laboratory University of California, Merced 5200 North Lake Road, Merced, CA 95343 USA.Abstract Reinforcement learning (RL) algorithms allow artificial agents to improve their selection of actions to increase rewarding experiences in their environments. Temporal Di ff erence (TD) Learning - a model-free RL method - is a leading account of the midbrain dopamine system and the basal ganglia in reinforcement learning. These algorithms typically learn a mapping from the agent's current sensed state to a selected action (known as a policy function) via learning a value function (expected future rewards). TD Learning methods have been very successful on a broad range of control tasks, but learning can become intractably slow as the state space of the environment grows. This has motivated methods that learn internal representations of the agent's state, e ffectively reducing the size of the state space and restructuring state representations in order to support generalization. However, TD Learning coupled with an artificial neural network, as a function approximator, has been shown to fail to learn some fairly simple control tasks, challenging this explanation of reward-based learning. We hypothesize that such failures do not arise in the brain because of the ubiquitous presence of lateral inhibition in the cortex, producing sparse distributed internal representations that support the learning of expected future reward. The sparse conjunctive representations can avoid catastrophic interference while still supporting generalization. We provide support for this conjecture through computational simulations, demonstrating the benefits of learned sparse representations for three problematic classic control tasks: Puddle-world, Mountain-car, and Acrobot. Introduction Reinforcement learning (RL) - a class of machine learning problems - is learning how to map situations to actions so as to maximize numerical reward signals received during the experiences that an artificial agent has as it interacts with its environment (Sutton and Barto, 1998). The agent may also be seen as having a goal (or goals) related to the state of the environment. Humans and nonhuman animals' capability of learning highly complex skills by reinforcing appropriate behaviors with reward and the role of midbrain dopamine system in reward-based learning has been well described by a class of a model-free RL, called T emporal Difference (TD) Learning (Montague et al., 1996; Schultz et al., 1997). While TD Learning, by itself, certainly does not explain all observed RL phenomena, increasing evidence suggests that it is key to the brain's adaptive nature (Dayan and Niv, 2008). One of the challenges that arise in RL in real-world problems is that the state space can be very large.


Approximate Kalman Filter Q-Learning for Continuous State-Space MDPs

arXiv.org Machine Learning

We seek to learn an effective policy for a Markov Decision Process (MDP) with continuous states via Q-Learning. Given a set of basis functions over state action pairs we search for a corresponding set of linear weights that minimizes the mean Bellman residual. Our algorithm uses a Kalman filter model to estimate those weights and we have developed a simpler approximate Kalman filter model that outperforms the current state of the art projected TD-Learning methods on several standard benchmark problems.