Reviews: Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Neural Information Processing Systems 

This paper presents new estimators for Off Policy Evaluation (OPE) based on likelihoods and argues that the new estimators are better than Importance Sampling (IS). The paper provides strong theoretical guarantees of the estimators, and demonstrates their through simple experiments. The reviewers agree that the paper is well written overall and the proposed methods are technically sound and likely to be built upon by the community. One reviewer is unsure if the proposed methods will be practical in RL applications. The experiments are performed on very simple tasks.