Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

Uehara, Masatoshi, Imaizumi, Masaaki, Jiang, Nan, Kallus, Nathan, Sun, Wen, Xie, Tengyang

Feb-4-2021–arXiv.org Machine Learning

Off-policy evaluation (OPE) is the problem of estimating the expected return in an unknown Markov decision process (MDP) of a given decision policy, known as the evaluation policy, using transition data generated by another policy, known as the behavior policy (Bibaut et al., 2019; Precup et al., 2000; Thomas et al., 2015). OPE is especially important in applications where experimentation is particularly costly, such as in medicine. Recently, the first-order efficiency bound for OPE was derived by Kallus and Uehara (2020) for time-varying MDPs and by Kallus and Uehara (2019) for time-homogeneous MDPs (which we focus on and simply call MDPs). That is, the smallest-possible coefficient of the leading 1/ n term C in the estimation error C/ n o(1/ n). In the time-varying tabular setting, the bounds coincides with that of Jiang and Li (2016), and Yin and Wang (2020) showed that the model-based estimator achieves it. However, the achievability of the lower bound in general settings is unclear. Among the approaches to OPE, many of them rely on estimating the q-function (representing long-term value) or the w-function (representing density ratios), under the so-called realizability (a.k.a well-specification) and/or completeness (a.k.a hypothesis class closed under Bellman operators; Antos et al., 2008; Chen and Jiang, 2019) assumptions. For example, the q-function can be estimated via Fitted-Q Iteration (FQI; Ernst et al., 2005), and the w-function is central to recent methods based on the idea of marginalized importance sampling (Gelada and Bellemare, 2019; Liu et al., 2018). In this paper, we study minimax estimators of q-and w-functions and its implications for OPE.

estimator, mil, sup, (15 more...)

arXiv.org Machine Learning

Feb-4-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York (0.04)
  - Illinois (0.04)
  - Rhode Island > Providence County
    - Providence (0.04)
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.14)
  - Netherlands > North Holland
    - Amsterdam (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.82)
  - Representation & Reasoning > Search (0.61)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found