pipeline psro
Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games
Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable PSRO-based method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of 10^50. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots.
Review for NeurIPS paper: Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games
Weaknesses: The paper is missing a comparison with the most relevant previous work, namely XFP [1] Heinrich, Johannes, and David Silver. Both of these works are mentioned in the Background and Related Work, but: 1) XFP is just mentioned but never compared to in experiments 2)DeepCFR is just discarded with "However, Deep CFR uses external sampling, which may be impractical for games with a large branching factor such as Stratego and Barrage Stratego." Furthermore, there are newer variants based on this work, and it is not limited to a particular form of sampling. The paper only really compares to other variants from the PSRO family Furthermore, the theory and algorithms (the way described in the text) deal only with matrix games, while the experiments are on extensive form games. If the goal is to run on top of the exponentially large matrix game, this should be discussed.
Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games
Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable PSRO-based method for finding approximate Nash equilibria in large zero-sum imperfect-information games.