Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation

Neural Information Processing Systems 

We consider the question of learning Q -function in a sample efficient manner for reinforcement learning with continuous state and action spaces under a generative model. If Q -function is Lipschitz continuous, then the minimal sample complexity for estimating \epsilon -optimal Q -function is known to scale as \Omega(\frac{1}{\epsilon {d_1 d_2 2}}) per classical non-parametric learning theory, where d_1 and d_2 denote the dimensions of the state and action spaces respectively. The Q -function, when viewed as a kernel, induces a Hilbert-Schmidt operator and hence possesses square-summable spectrum. This motivates us to consider a parametric class of Q -functions parameterized by its "rank" r, which contains all Lipschitz Q -functions as r\to\infty . As our key contribution, we develop a simple, iterative learning algorithm that finds \epsilon -optimal Q -function with sample complexity of \widetilde{O}(\frac{1}{\epsilon {\max(d_1, d_2) 2}}) when the optimal Q -function has low rank r and the discounting factor \gamma is below a certain threshold.