Reviews: Finite sample analysis of the GTD Policy Evaluation Algorithms in Markov Setting

Neural Information Processing Systems 

It is well known that the standard TD algorithm widely used in reinforcement learning does not correspond to the gradient of any objective function, and consequently is unstable when combined with any type of function approximation. Despite the success of methods like deep RL, which combines vanilla TD with deep learning, theoretically TD with nonlinear function approximation is demonstrably unstable. Much work on fixing this fundamental flaw in RL has been in vain, till the work on gradient TD methods by Sutton et al. Unfortunately, these methods work, but their analysis was flawed, based on a heuristic derivation of the method. A recent breakthrough by Liu et al. (UAI 2015) showed that gradient TD methods are essentially saddle point methods that are pure gradient methods that optimize not the original gradient TD loss function (which they do not), but rather the saddle point loss function that arises when converting the original loss function into the dual space.