Reviews: Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Neural Information Processing Systems 

This paper provides with a theoretical analysis of optimistic algorithms in the setting of episodic MDPs. The authors provide with a problem-dependent and gap-dependent log(T) regret bound which is a finite-time bound, that unlike previous bounds does not take into account other problem properties such as diameter or mixing times. The paper includes several theoretical results, and introduces a novel'clipping' trick used to prove the bounds. I found the paper interesting, however, it is constructed in a way which is hard to follow and not very readable. First, many concepts are referred to before they are defined (for instance, the notion of optimistic algorithms).