Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Xie, Tengyang, Ma, Yifei, Wang, Yu-Xiang

Mar-19-2020, 00:31:39 GMT–Neural Information Processing Systems

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) --- the problem of evaluating a new policy using the historical data obtained by different behavior policies --- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. The result matches the Cramer-Rao lower bound in [Jiang and Li, 2016] up to a multiplicative factor of $H$. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on $H$.

marginalized importance sampling, optimal off-policy evaluation, reinforcement learning, (3 more...)

Neural Information Processing Systems

Mar-19-2020, 00:31:39 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.63)