Review for NeurIPS paper: Planning in Markov Decision Processes with Gap-Dependent Sample Complexity

Neural Information Processing Systems 

Additional Feedback: Post-rebuttal The authors addressed some of my concerns. As the authors would redesign some of the experiments in the revision, I'd raise my score to 6. Comments and questions: 1. Are there any lower bound results on the sample complexity of planning? Are there any particular reasons, and what is the high-level idea of this algorithm? If I understand correctly this rule is to get the gap-dependent sample complexity. What if we use the simple greedy policy for the first action, and what will go wrong in the proof?