Reviews: Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Neural Information Processing Systems 

The paper studies the important problem of off-policy policy evaluation in long-horizon MDPs. The setting focuses on small-state, large-action problems. A novel estimator is proposed, whose finite-sample statistical properties are studied. Empirical results show the method is useful, especially in partially observable problems. Reviewers feel the experiment section can be strengthened (e.g., using more domains).