Uniform Last-Iterate Guarantee for Bandits and Reinforcement Learning

Mar-22-2026, 20:43:34 GMT–Neural Information Processing Systems

Existing metrics for reinforcement learning (RL) such as regret, PAC bounds, or uniform-PAC (Dann et al., 2017), typically evaluate the cumulative performance, while allowing the play of an arbitrarily bad policy at any finite time t. Such a behavior can be highly detrimental in high-stakes applications. This paper introduces a stronger metric, uniform last-iterate (ULI) guarantee, capturing both cumulative and instantaneous performance of RL algorithms. Specifically, ULI characterizes the instantaneous performance since it ensures that the per-round suboptimality of the played policy is bounded by a function, monotonically decreasing w.r.t.

artificial intelligence, machine learning, reinforcement learning, (9 more...)

Neural Information Processing Systems

Mar-22-2026, 20:43:34 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)