SupplementaryMaterials AProofofTheorem2: AsymptoticConvergenceofRobustQ-Learning

Neural Information Processing Systems 

From[BorkarandMeyn,2000],weknowthatthestochastic approximation (18) converges to the fixed point ofT, i.e., Q . Finally, to show Theorem 3, we only need to show each term in(56) is smaller than . In this section we develop the finite-time analysis of the robust TDC algorithm. We note that recently there are several works [Srikant and Ying, 2019, Xu and Liang, 2021, Kaledin et al., 2020] on finite-time analysis of RL algorithms that do not need theprojection. Specifically, the problem in [Srikant and Ying, 2019] is for one time scalelinear stochastic approximation.