dsat
Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function
We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SATH})$\footnote{The symbol $\tilde{O}$ means $O$ with log factors ignored.
Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function
We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function h {*}, the proposed algorithm achieves a regret bound of \tilde{O}(\sqrt{SATH}) \footnote{The symbol \tilde{O} means O with log factors ignored. Furthermore, this regret bound matches the lower bound of \Omega(\sqrt{SATH}) \cite{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of \tilde{O}(\sqrt{DSAT}) for MDPs with finite diameter D compared to the lower bound of \Omega(\sqrt{DSAT}) \cite{jaksch2010near}.
Near-optimal Regret Bounds for Reinforcement Learning
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s1,s2 there is a policy which moves from s1 to s2 in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DSAT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega(DSAT) on the total regret of any learning algorithm. Both bounds demonstrate the utility of the diameter as structural parameter of the MDP.
Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function
We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h {*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SATH})$\footnote{The symbol $\tilde{O}$ means $O$ with log factors ignored. This result outperforms the best previous regret bounds $\tilde{O}(HS\sqrt{AT})$\cite{bartlett2009regal} by a factor of $\sqrt{SH}$. Furthermore, this regret bound matches the lower bound of $\Omega(\sqrt{SATH})$\cite{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of $\tilde{O}(\sqrt{DSAT})$ for MDPs with finite diameter $D$ compared to the lower bound of $\Omega(\sqrt{DSAT})$\cite{jaksch2010near}.
Near-optimal Regret Bounds for Reinforcement Learning
Auer, Peter, Jaksch, Thomas, Ortner, Ronald
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s1,s2 there is a policy which moves from s1 to s2 in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DSAT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega(DSAT) on the total regret of any learning algorithm. Both bounds demonstrate the utility of the diameter as structural parameter of the MDP.