Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency
Uehara, Masatoshi, Imaizumi, Masaaki, Jiang, Nan, Kallus, Nathan, Sun, Wen, Xie, Tengyang
Off-policy evaluation (OPE) is the problem of estimating the expected return in an unknown Markov decision process (MDP) of a given decision policy, known as the evaluation policy, using transition data generated by another policy, known as the behavior policy (Bibaut et al., 2019; Precup et al., 2000; Thomas et al., 2015). OPE is especially important in applications where experimentation is particularly costly, such as in medicine. Recently, the first-order efficiency bound for OPE was derived by Kallus and Uehara (2020) for time-varying MDPs and by Kallus and Uehara (2019) for time-homogeneous MDPs (which we focus on and simply call MDPs). That is, the smallest-possible coefficient of the leading 1/ n term C in the estimation error C/ n o(1/ n). In the time-varying tabular setting, the bounds coincides with that of Jiang and Li (2016), and Yin and Wang (2020) showed that the model-based estimator achieves it. However, the achievability of the lower bound in general settings is unclear. Among the approaches to OPE, many of them rely on estimating the q-function (representing long-term value) or the w-function (representing density ratios), under the so-called realizability (a.k.a well-specification) and/or completeness (a.k.a hypothesis class closed under Bellman operators; Antos et al., 2008; Chen and Jiang, 2019) assumptions. For example, the q-function can be estimated via Fitted-Q Iteration (FQI; Ernst et al., 2005), and the w-function is central to recent methods based on the idea of marginalized importance sampling (Gelada and Bellemare, 2019; Liu et al., 2018). In this paper, we study minimax estimators of q-and w-functions and its implications for OPE.
- Country:
- North America > United States
- New York (0.04)
- Illinois (0.04)
- Rhode Island > Providence County
- Providence (0.04)
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- Netherlands > North Holland
- Amsterdam (0.04)
- United Kingdom > England
- Asia
- Middle East > Jordan (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States
- Genre:
- Research Report (1.00)
- Technology: