Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting
Wang, Yue, Chen, Wei, Liu, Yuting, Ma, Zhi-Ming, Liu, Tie-Yan
–Neural Information Processing Systems
In reinforcement learning (RL), one of the key components is policy evaluation, which aims to estimate the value function (i.e., expected long-term accumulated reward) of a policy. With a good policy evaluation method, the RL algorithms will estimate the value function more accurately and find a better policy. When the state space is large or continuous \emph{Gradient-based Temporal Difference(GTD)} policy evaluation algorithms with linear function approximation are widely used. Considering that the collection of the evaluation data is both time and reward consuming, a clear understanding of the finite sample performance of the policy evaluation algorithms is very important to reinforcement learning. Under the assumption that data are i.i.d.
Neural Information Processing Systems
Feb-14-2020, 17:28:08 GMT
- Technology: