Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback

Neural Information Processing Systems 

We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is *uncoupled*, *convergent*, and *rational*, with non-asymptotic convergence rates to Nash equilibrium. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an \tilde{\mathcal{O}}(t {-\frac{1}{8}}) last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of \tilde{\mathcal{O}}(t {-\frac{1}{9 \varepsilon}}) for any \varepsilon 0 . Finally, we study Markov games without any assumptions on the dynamics, and show a *path convergence* rate, a new notion of convergence we defined, of \tilde{\mathcal{O}}(t {-\frac{1}{10}}) .