Self-play reinforcement learning has proved to be successful in many perfect information two-player games. However, research carrying over its theoretical guarantees and practical success to games of imperfect information has been lacking. In this paper, we evaluate self-play Monte-Carlo Tree Search in limit Texas Hold'em and Kuhn poker. We introduce a variant of the established UCB algorithm and provide first empirical results demonstrating its ability to find approximate Nash equilibria.
Many existing Reinforcement Learning (RL) systems already rely on simulations to explore the solution space and solve complex problems. These include systems based on Self-Play for gaming applications. Self-Play is an essential part of the algorithms used by Google DeepMind in AlphaGo and in the more recent AlphaGo Zero reinforcement learning systems. These are the breakthrough approaches that have defeated the world champion at the ancient Chinese game of Go (D. Silver et al., 2017 https://www.nature.com/articles/nature24270 The newer AlphaGo Zero system has achieved a significant step forward compared to the original Alpha Go system.
The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.