Achieving \tilde{O}(1/\epsilon) Sample Complexity for Constrained Markov Decision Process

Neural Information Processing Systems 

We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making. In this problem, we are given finite resources and a MDP with unknown transition probabilities. At each stage, we take an action, collecting a reward and consuming some resources, all assumed to be unknown and need to be learned over time. In this work, we take the first step towards deriving optimal problem-dependent guarantees for the CMDP problems. We derive a logarithmic regret bound, which translates into a O(\frac{1}{\Delta\cdot\epsilon}\cdot\log 2(1/\epsilon)) sample complexity bound, with \Delta being a problem-dependent parameter, yet independent of \epsilon .