Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function
–Neural Information Processing Systems
We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SATH})$\footnote{The symbol $\tilde{O}$ means $O$ with log factors ignored.
Neural Information Processing Systems
Dec-25-2025, 19:13:59 GMT
- Technology: