Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback

Neural Information Processing Systems 

Our paper takes the first step to remove such assumptions.