New Versions of Gradient Temporal Difference Learning

Lee, Donghwan, Lim, Han-Dong, Park, Jihoon, Choi, Okyong

arXiv.org Artificial Intelligence 

This Temporal-difference (TD) learning [1] is one of the most popular approach does not allow general and formal analysis frameworks reinforcement learning (RL) algorithms [2] for policy evaluation because the asymptotic stability of the ODE model significantly problems. However, its main limitation lies in its inability to accommodate depends on the specific algorithm, and it is in general hard to both off-policy learning and linear function approximation for establish the stability of the ODE model. On the other hand, convergence guarantees, which has been an important open problem the proposed analysis applies the recent asymptotic stability for decades. In 2009, Sutton, Szepesvári, and Maei [3], [4] introduced theory of primal-dual gradient dynamics (PDGD) [13], where the first TD learning algorithms compatible with both linear function control theoretic frameworks for stability analysis of PDGD are approximation and off-policy training based on gradient estimations, developed. Using this recent result, we provide a new template which are thus called gradient temporal-difference learning (GTD).