Statistical Inference for Temporal Difference Learning with Linear Function Approximation
Wu, Weichen, Li, Gen, Wei, Yuting, Rinaldo, Alessandro
Statistical inference tasks, such as constructing confidence regions or simultaneous confidence intervals, are often addressed by deriving distributional theory such as central limit theorems (CLTs) for the estimator in use. Due to the high dimensionality of modern science and engineering applications, there has been a surge of interests in deriving convergence results that are valid in a finite-sample manner. In Reinforcement Learning (RL), a discipline that underpins many recent machine learning breakthroughs (Agarwal et al. (2019); Sutton and Barto (2018)) one central question is to evaluate with confidence the quality of a given policy, measured by its value function. As RL is often modeled as decision making in Markov decision processes (MDPs), the task of statistical inference needs to accommodate the online nature of the sampling mechanism. Temporal Difference (TD) learning (Sutton (1988)) is arguably the most widely used algorithm designed for this purpose. TD learning, which is an instance of stochastic approximation (SA) method (Robbins and Monro (1951)), approximates the value function of a given policy in an iterative manner. Towards understanding the non-asymptotic convergence rate of TD to the target value function, extensive recent efforts have been made (see, e.g.
Oct-21-2024
- Country:
- North America > United States
- Pennsylvania (0.27)
- Texas (0.27)
- North America > United States
- Genre:
- Research Report > New Finding (0.67)