Operator Augmentation for Model-based Policy Evaluation

Tang, Xun, Ying, Lexing, Zhu, Yuhua

arXiv.org Machine Learning 

Reinforcement learning (RL) has received much attention following recent successes, such as AlphaGo and AlphaZero [25, 26]. One of the fundamental problems of RL is policy evaluation [29]. When the transition dynamics are unknown, one learns the dynamics model from observed data in model-based RL. However, even if the learned model is an unbiased estimate of the true dynamics, the policy evaluation under the learned model is biased. The question of interest in this paper is whether one can increase the accuracy of the policy evaluation given an estimated dynamics model. We consider a discounted Markov decision process (MDP) M (S, A, P, r, γ) with discrete state space S and discrete action space A. S and A are used to denote the size of S and A, respectively.