Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model

Neural Information Processing Systems 

In this paper we consider the problem of computing an $\epsilon$-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in $O(1)$ time.