which, to the best of our knowledge, is not covered by existing work. TD and SGD: Our unified analysis covers both TD and SGD, where TD is not covered by any existing supervised

Neural Information Processing Systems 

We appreciate the valuable comments from the reviewers. In contrast, in supervised learning, such a matrix is the Hessian, which must be symmetric. A "straightforward adaptation" of existing supervised learning analysis does not yield the global convergence of TD. Nonconvex mirror descent: Most existing analysis of mirror descent's convergence to a global optimum builds Error propagation: RL is divided into policy-based and value-based approaches. In particular, the Q-function tracked in Q-learning is not the action-value function of any policy.