Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
–Neural Information Processing Systems
While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite (1 \epsilon) -th moments for some \epsilon\in(0,1] . In this work, we address the challenge of such rewards in RL with linear function approximation. Here, d is the feature dimension, and u_t {1 \epsilon} is the (1 \epsilon) -th central moment of the reward at the t -th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation.
Neural Information Processing Systems
Jan-19-2025, 19:32:08 GMT
- Technology: