Wu, Weichen
Uncertainty quantification for Markov chains with application to temporal difference learning
Wu, Weichen, Wei, Yuting, Rinaldo, Alessandro
Markov chains are fundamental to statistical machine learning, underpinning key methodologies such as Markov Chain Monte Carlo (MCMC) sampling and temporal difference (TD) learning in reinforcement learning (RL). Given their widespread use, it is crucial to establish rigorous probabilistic guarantees on their convergence, uncertainty, and stability. In this work, we develop novel, high-dimensional concentration inequalities and Berry-Esseen bounds for vector- and matrix-valued functions of Markov chains, addressing key limitations in existing theoretical tools for handling dependent data. We leverage these results to analyze the TD learning algorithm, a widely used method for policy evaluation in RL. Our analysis yields a sharp high-probability consistency guarantee that matches the asymptotic variance up to logarithmic factors. Furthermore, we establish a $O(T^{-\frac{1}{4}}\log T)$ distributional convergence rate for the Gaussian approximation of the TD estimator, measured in convex distance. These findings provide new insights into statistical inference for RL algorithms, bridging the gaps between classical stochastic approximation theory and modern reinforcement learning applications.
Statistical Inference for Temporal Difference Learning with Linear Function Approximation
Wu, Weichen, Li, Gen, Wei, Yuting, Rinaldo, Alessandro
Statistical inference tasks, such as constructing confidence regions or simultaneous confidence intervals, are often addressed by deriving distributional theory such as central limit theorems (CLTs) for the estimator in use. Due to the high dimensionality of modern science and engineering applications, there has been a surge of interests in deriving convergence results that are valid in a finite-sample manner. In Reinforcement Learning (RL), a discipline that underpins many recent machine learning breakthroughs (Agarwal et al. (2019); Sutton and Barto (2018)) one central question is to evaluate with confidence the quality of a given policy, measured by its value function. As RL is often modeled as decision making in Markov decision processes (MDPs), the task of statistical inference needs to accommodate the online nature of the sampling mechanism. Temporal Difference (TD) learning (Sutton (1988)) is arguably the most widely used algorithm designed for this purpose. TD learning, which is an instance of stochastic approximation (SA) method (Robbins and Monro (1951)), approximates the value function of a given policy in an iterative manner. Towards understanding the non-asymptotic convergence rate of TD to the target value function, extensive recent efforts have been made (see, e.g.
Sharp high-probability sample complexities for policy evaluation with linear function approximation
Li, Gen, Wu, Weichen, Chi, Yuejie, Ma, Cong, Rinaldo, Alessandro, Wei, Yuting
This paper is concerned with the problem of policy evaluation with linear function approximation in discounted infinite horizon Markov decision processes. We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms: the temporal difference (TD) learning algorithm and the two-timescale linear TD with gradient correction (TDC) algorithm. In both the on-policy setting, where observations are generated from the target policy, and the off-policy setting, where samples are drawn from a behavior policy potentially different from the target policy, we establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level. We also exhihit an explicit dependence on problem-related quantities, and show in the on-policy setting that our upper bound matches the minimax lower bound on crucial problem parameters, including the choice of the feature maps and the problem dimension.