Review for NeurIPS paper: Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Neural Information Processing Systems 

Weaknesses: The empirically evaluation misses relevant baselines, making it quite hard to evaluate the usefulness of ESRL in comparison to prior approaches. The main algorithm (Algo 1) incorporates the use of majority voting and hypothesis testing in addition to learning multiple Q-estimates based on K sampled MDPs. Furthermore, based on the figure captions, K seems to be large (250 for Riverswim, 500 for Sepsis) and it seems unfair to use a single DQN model. A *naive* baseline would be to use the ensemble of these K Q-estimates and simply use their mean for selecting actions: this *quantifies* the empirical benefit from hypothesis testing. This should be discussed in the paper as well as empirically compared to as should be made as this is a simple way to incorporate value uncertainty in offline RL. 3. As mentioned in the paper, ESRL can deviate from the behavior policy when required or stick to it depending on the hypothesis testing.