Reinforcement Learning under Model Mismatch
Roy, Aurko, Xu, Huan, Pokutta, Sebastian
Reinforcement learning is concerned with learning a good policy for sequential decision making problems modeled as a Markov Decision Process (MDP), via interacting with the environment [22, 20]. In this work we address the problem of reinforcement learning from a misspecified model. As a motivating example, consider the scenario where the problem of interest is not directly accessible, but instead the agent can interact with a simulator whose dynamics is reasonably close to the true problem. Another plausible application is when the parameters of the model may evolve over time but can still be reasonably approximated by an MDP. To address this problem we use the framework of robust MDPs which was proposed by [2, 17, 13] to solve the planning problem under model misspecification. The robust MDP framework considers a class of models and finds the robust optimal policy which is a policy that performs best under the worst model. It was shown by [2, 17, 13] that the robust optimal policy satisfies the robust Bellman equation which naturally leads to exact dynamic programming algorithms to find an optimal policy. However, this approach is model dependent and does not immediately generalize to the model-free case where the parameters of the model are unknown. Essentially, reinforcement learning is a model-free framework to solve the Bellman equation using samples.
Nov-8-2017