Uniform Last-Iterate Guarantee for Bandits and Reinforcement Learning

Neural Information Processing Systems 

Existing metrics for reinforcement learning (RL) such as regret, PAC bounds, or uniform-PAC [Dann et al., 2017], typically evaluate the cumulative performance, while allowing the agent to play an arbitrarily bad policy at any finite time t. Such a behavior can be highly detrimental in high-stakes applications. This paper introduces a stronger metric, uniform last-iterate (ULI) guarantee, capturing both cumulative and instantaneous performance of RL algorithms. Specifically, ULI characterizes the instantaneous performance by ensuring that the per-round suboptimality of the played policy is bounded by a function, monotonically decreasing w.r.t.