Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

Uehara, Masatoshi, Imaizumi, Masaaki, Jiang, Nan, Kallus, Nathan, Sun, Wen, Xie, Tengyang

arXiv.org Machine Learning 

Off-policy evaluation (OPE) is the problem of estimating the expected return in an unknown Markov decision process (MDP) of a given decision policy, known as the evaluation policy, using transition data generated by another policy, known as the behavior policy (Bibaut et al., 2019; Precup et al., 2000; Thomas et al., 2015). OPE is especially important in applications where experimentation is particularly costly, such as in medicine. Recently, the first-order efficiency bound for OPE was derived by Kallus and Uehara (2020) for time-varying MDPs and by Kallus and Uehara (2019) for time-homogeneous MDPs (which we focus on and simply call MDPs). That is, the smallest-possible coefficient of the leading 1/ n term C in the estimation error C/ n o(1/ n). In the time-varying tabular setting, the bounds coincides with that of Jiang and Li (2016), and Yin and Wang (2020) showed that the model-based estimator achieves it. However, the achievability of the lower bound in general settings is unclear. Among the approaches to OPE, many of them rely on estimating the q-function (representing long-term value) or the w-function (representing density ratios), under the so-called realizability (a.k.a well-specification) and/or completeness (a.k.a hypothesis class closed under Bellman operators; Antos et al., 2008; Chen and Jiang, 2019) assumptions. For example, the q-function can be estimated via Fitted-Q Iteration (FQI; Ernst et al., 2005), and the w-function is central to recent methods based on the idea of marginalized importance sampling (Gelada and Bellemare, 2019; Liu et al., 2018). In this paper, we study minimax estimators of q-and w-functions and its implications for OPE.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found