Review for NeurIPS paper: The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning

Neural Information Processing Systems 

Weaknesses: – Sec 3: You method strongly depends on the'top-terminal fraction'. I see multiple potential problems: 1) what worries me most, is that it only measures optimality. What if my model-based agent adapts very fast to the new domain but reaches just below optimal performance. Then my MBRL method can be very effective, but the LoCA regret will still be very large. Note that the regret at the bottom of P4 cannot correct for this, as it sums all timesteps and multiplies with the success fraction), 3) in more complicated tasks, it can be hard to determine the optimal behaviour, i.e., to even define the'top-terminal fraction'.