Reviews: Online Reinforcement Learning in Stochastic Games

Neural Information Processing Systems 

The paper considers the problem of online learning in two-player zero-sum stochastic games. The main result is constructing a strategy for player 1 that guarantees that the cumulative rewards will never go below the maximin value of the game by more than a certain bound, no matter what strategy the other player follows. The bound is shown to grow sublinearly in the number of rounds T of the game, and polynomially on other problem parameters such as the diameter, the size of the state and action spaces. The results imply that the proposed algorithm can be used in self-play to compute near-maximin strategies for both players. The algorithm and the analysis are largely based on the UCRL algorithm of Auer and Ortner (2007) and the analysis thereof.