Reviews: Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Neural Information Processing Systems 

The paper contributes useful structural results in regret minimization in the Markov Decision Process setting of RL, specifically for the class of tabular (i.e., unstructured) finite-horizon episodic MDPs. The paper is likely to stimulate the finite-sample analysis of online learning in MDPs via its new theoretical techniques.