Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

Neural Information Processing Systems 

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function h {*}, the proposed algorithm achieves a regret bound of \tilde{O}(\sqrt{SATH}) \footnote{The symbol \tilde{O} means O with log factors ignored. Furthermore, this regret bound matches the lower bound of \Omega(\sqrt{SATH}) \cite{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of \tilde{O}(\sqrt{DSAT}) for MDPs with finite diameter D compared to the lower bound of \Omega(\sqrt{DSAT}) \cite{jaksch2010near}.