doubly-robust off-policy evaluation
Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy
Lee, Kyungbok, Paik, Myunghee Cho
In various decision-making problems, estimating the value, the expected reward of a policy is a crucial question that needs to be addressed. Online evaluation requiring a comprehensive evaluation of policy value can be expensive and may not be applicable to multiple target policies. Alternatively, off-policy evaluation (OPE) refers to a technique that estimates the value of a target policy by utilizing log data generated from a different logging policy. This approach has attracted considerable interest in the domains of contextual bandits (CB) [Dudík et al., 2011, Swaminathan et al., 2017] and reinforcement learning (RL) [Precup, 2000, Mahmood et al., 2014, Jiang and Li, 2016]. Several off-policy evaluation algorithms [Dudík et al., 2011, Thomas and Brunskill, 2016, Wang et al., 2017, Farajtabar et al., 2018, Su et al., 2020] currently in use rely on having complete knowledge of the logging policy in order to utilize inverse probability weighting (IPW).