Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning

Yin, Ming, Bai, Yu, Wang, Yu-Xiang

arXiv.org Artificial Intelligence 

Reinforcement learning (RL), is the problem described by an agent interacting with an environment in order to maximize its cumulative rewards through time Sutton & Barto (2018). Among the great landscapes of reinforcement learning, off-policy evaluation (OPE) refers to the problem of predicting the performance of a policy with data only collected by a logging/behavioral policy, and is of crucial importance to real-world applications of RL including marketing Thomas et al. (2017), targeted advertising Bottou et al. (2013); Tang et al.(2013), finance Bertoluzzo & Corazza(2012), robotics Quillen et al.(2018), and healthcare Ernst et al. (2006); Raghu et al. (2017, 2018). A central challenge in OPE is the distributional mismatch between the behavioral policy and the target policy, which has been tackled in previous studies using Importance Sampling (IS) based methods Li et al. (2011); Dudík et al. (2011); Li et al. (2015); Thomas & Brunskill (2016) or its hybridversions such as doubly robust estimators Jiang & Li (2016); Farajtabar et al. (2018). More recently, a family of estimators based on marginalized importance sampling (MIS) Liu et al. (2018); Xie et al. (2019); Kallus & Uehara (2019a,b); Yin & Wang (2020) have been proposed in order to overcome the "curse of horizon", which refers to the phenomenon of OPE problem that any unbiased estimator has to suffer the variance which is exponential in horizon for some MDP class Jiang & Li (2016); Liu et al. (2018). However, previous works only consider the OPE problem of a fixed (non data-dependent) target policy π, whereas in practice it is common that we need to evaluate the performance of a data-dependent one.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found