Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Jun-3-2024–arXiv.org Machine Learning

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S \times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.

algorithm, confidence region, inequality, (13 more...)

arXiv.org Machine Learning

Jun-3-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia > Arlington County > Arlington (0.04)
- Europe > France
  - Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- Asia > Japan
  - Honshū > Tōhoku (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.46)
  - Representation & Reasoning
    - Search (0.61)
    - Optimization (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found