Review for NeurIPS paper: Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting

Neural Information Processing Systems 

Weaknesses: I have some concerns and questions: - In order to come up with an efficiently-implementable algorithm, for DORL the authors construct an optimistic MDP following a very simple construction. This construction only considers error bounds and completely ignores the value function. So, while the proof claims the optimism is guaranteed, I believe that the resulting optimistic MDP is overly-optimsitic, and to favor computational efficiency, this way one may sacrifice learning efficiency to a large extent. Indeed, the idea of optimizing over a finite set of MDPs (in lieu of the bounded-parameter MDP) is nice. However, I believe the current construction that completely ignores the value function is too naive to work in practice.