Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Neural Information Processing Systems 

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs).However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies.In this paper, we present the first *tractable* algorithm with minimax optimal regret of \mathrm{O}\left(\sqrt{\mathrm{sp}(h *) S A T \log(SAT)}\right) where \mathrm{sp}(h *) is the span of the optimal bias function h *, S\times A is the size of the state-action space and T the number of learning steps. Remarkably, our algorithm does not require prior information on \mathrm{sp}(h *) . Our algorithm relies on a novel subroutine, **P**rojected **M**itigated **E**xtended **V**alue **I**teration ( PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to obtain improved regret bounds.