Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

Jan-18-2025, 22:11:42 GMT–Neural Information Processing Systems

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI- \gamma, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI- \gamma achieves an \tilde{O}\big({\sqrt{SAT}}/{(1-\gamma) {1.5}}\big) regret, where S is the number of states, A is the number of actions, \gamma is the discount factor and T is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least \tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma) {1.5}}\big) . Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI- \gamma is nearly minimax optimal for discounted MDPs.

gamma, minimax optimal reinforcement learning, ucbvi-gamma, (4 more...)

Neural Information Processing Systems

Jan-18-2025, 22:11:42 GMT

Conferences Web Page

Add feedback

Industry:
- Education > Focused Education > Special Education (0.30)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.66)
  - Representation & Reasoning > Search (0.93)