Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

Neural Information Processing Systems 

The environment and an agent's interactions are typically modeled as a Markov