Uniform Last-Iterate Guarantee for Bandits and Reinforcement Learning

Mar-27-2025, 14:38:22 GMT–Neural Information Processing Systems

Existing metrics for reinforcement learning (RL) such as regret, PAC bounds, or uniform-PAC [Dann et al., 2017], typically evaluate the cumulative performance, while allowing the agent to play an arbitrarily bad policy at any finite time t. Such a behavior can be highly detrimental in high-stakes applications. This paper introduces a stronger metric, uniform last-iterate (ULI) guarantee, capturing both cumulative and instantaneous performance of RL algorithms. Specifically, ULI characterizes the instantaneous performance by ensuring that the per-round suboptimality of the played policy is bounded by a function, monotonically decreasing w.r.t.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Mar-27-2025, 14:38:22 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > California (0.14)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)