Uniform Last-Iterate Guarantee for Bandits and Reinforcement Learning

Neural Information Processing Systems 

Existing metrics for reinforcement learning (RL) such as regret, PAC bounds, or uniform-PAC (Dann et al., 2017), typically evaluate the cumulative performance, while allowing the play of an arbitrarily bad policy at any finite time t. Such a behavior can be highly detrimental in high-stakes applications. This paper introduces a stronger metric, uniform last-iterate (ULI) guarantee, capturing both cumulative and instantaneous performance of RL algorithms. Specifically, ULI characterizes the instantaneous performance since it ensures that the per-round suboptimality of the played policy is bounded by a function, monotonically decreasing w.r.t.