The Advantage Regret-Matching Actor-Critic
Gruslys, Audrūnas, Lanctot, Marc, Munos, Rémi, Timbers, Finbarr, Schmid, Martin, Perolat, Julien, Morrill, Dustin, Zambaldi, Vinicius, Lespiau, Jean-Baptiste, Schultz, John, Azar, Mohammad Gheshlaghi, Bowling, Michael, Tuyls, Karl
–arXiv.org Artificial Intelligence
Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior. We propose a model-free RL algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC): rather than saving past state-action data, ARMAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.
arXiv.org Artificial Intelligence
Aug-27-2020
- Country:
- North America > United States > Texas (0.25)
- Genre:
- Research Report (0.64)
- Industry:
- Leisure & Entertainment > Games > Computer Games (0.67)
- Technology: