Reinforcement Learning
On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor
Kolodziejski, Christoph, Porr, Bernd, Tamosiunaite, Minija, Wörgötter, Florentin
In this theoretical contribution we provide mathematical proof that two of the most important classes of network learning - correlation-based differential Hebbian learningand reward-based temporal difference learning - are asymptotically equivalent when timing the learning with a local modulatory signal. This opens the opportunity to consistently reformulate most of the abstract reinforcement learning frameworkfrom a correlation based perspective that is more closely related to the biophysics of neurons.
Regularized Policy Iteration
Farahmand, Amir M., Ghavamzadeh, Mohammad, Mannor, Shie, Szepesvári, Csaba
In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2-regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions.
Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
Castro, Dotan D., Volkinshtein, Dmitry, Meir, Ron
Actor-critic algorithms for reinforcement learning are achieving renewed popularity dueto their good convergence properties in situations where other approaches often fail (e.g., when function approximation is involved). Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through cortical and basal ganglia loops. We derive a temporal difference based actor critic learning algorithm, for which convergence can be proved without assuming widely separated time scales for the actor and the critic. The approach is demonstrated by applying it to networks of spiking neurons. The established relation between phasic dopamine and the temporal difference signal lends support to the biological relevance of such algorithms.
Near-optimal Regret Bounds for Reinforcement Learning
Auer, Peter, Jaksch, Thomas, Ortner, Ronald
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s1,s2 there is a policy which moves from s1 to s2 in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DSAT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega(DSAT) on the total regret of any learning algorithm. Both bounds demonstrate the utility of the diameter as structural parameter of the MDP.
Multi-Step Dyna Planning for Policy Evaluation and Control
Yao, Hengshuai, Bhatnagar, Shalabh, Diao, Dongcui, Sutton, Richard S., Szepesvári, Csaba
We extend Dyna planning architecture for policy evaluation and control in two significant aspects. First, we introduce a multi-step Dyna planning that projects the simulated state/feature many steps into the future. Our multi-step Dyna is based on a multi-step model, which we call the {\em $\lambda$-model}. The $\lambda$-model interpolates between the one-step model and an infinite-step model, and can be learned efficiently online. Second, we use for Dyna control a dynamic multi-step model that is able to predict the results of a sequence of greedy actions and track the optimal policy in the long run. Experimental results show that Dyna using the multi-step model evaluates a policy faster than using single-step models; Dyna control algorithms using the dynamic tracking model are much faster than model-free algorithms; further, multi-step Dyna control algorithms enable the policy and value function to converge much faster to their optima than single-step Dyna algorithms.
Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
Rohanimanesh, Khashayar, Singh, Sameer, McCallum, Andrew, Black, Michael J.
Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these sampling-based methods suffer from local minima|the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed|we bypass the need to compute marginals entirely. Our method provides dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48\% reduction in error over state-of-the-art in that domain.
A Generalized Natural Actor-Critic Algorithm
Morimura, Tetsuro, Uchibe, Eiji, Yoshimoto, Junichiro, Doya, Kenji
Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention,seeking stochastic policies that maximize the average (or discounted cumulative) reward. In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efficiency because these regard metrics for the task. Though there are two candidate metrics, Kakade's Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura's FIM for the stateaction jointdistribution, but all RL algorithms with NG have followed Kakade's approach. In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efficient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic(gNAC) algorithm. The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. Numerical experimentsshowed that the proposed gNAC algorithm can estimate gNG efficiently and outperformed the NAC algorithm.
Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Bhatnagar, Shalabh, Precup, Doina, Silver, David, Sutton, Richard S., Maei, Hamid R., Szepesvári, Csaba
We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD($\lambda$), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al (2009a,b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms for any finite Markov decision process and any smooth value function approximator, under usual stochastic approximation conditions. The computational complexity per iteration scales linearly with the number of parameters of the approximator. The algorithms are incremental and are guaranteed to converge to locally optimal solutions.
Solving Stochastic Games
Dermed, Liam M., Isbell, Charles L.
Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games to within $\epsilon$ relative error of the optimal game-theoretic solution, in time polynomial in $1/\epsilon$. Our algorithm extends Murrays and Gordon’s (2007) modified Bellman equation which determines the \emph{set} of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts.