Characterization of Efficient Influence Function for Off-Policy Evaluation Under Optimal Policies

Wei, Haoyu

arXiv.org Machine Learning 

Reinforcement learning (RL) focusing on developing optimal policies for sequential decision-making to maximize long-term rewards, (Sutton & Barto, 2018) have been serving as more and more important frontier in various fields. A critical component of RL is off-policy evaluation (OPE), which estimates the mean reward of a policy, termed the evaluation policy, using data collected under another policy, known as the behavior policy. OPE is essential in offline RL, where only historical datasets are available, precluding new experiments (Luedtke & V an Der Laan, 2016; Agarwal et al., 2019; Uehara et al., 2022). Recent years have witnessed substantial progress in developing statistically efficient OPE methods, with various approaches demonstrating semiparametric efficiency under different model settings (Jiang & Li, 2016; Kallus & Uehara, 2020; Shi et al., 2021). However, all of these existing analyses focus on scenarios where the evaluation policy is fixed and predetermined. A more challenging yet practical scenario arises when the evaluation policy itself is estimated from data, particularly when this policy is designed to be optimal with respect to some criterion. In this context, the statistical properties of OPE become more complex due to the additional estimation uncertainty introduced by the policy optimization process. In contrast, in the causal inference literature, such phenomena have been studied extensively in the optimal treatment regime literature (Laber et al., 2014; Kosorok & Laber, 2019; Athey & Wager, 2021). These works have established important results regarding the estimation of value functions under optimal treatment rules, but their direct application to the sequential decision-making context of RL presents additional challenges due to the temporal dependencies and potentially infinite horizons involved.