Trust Region Bounds for Decentralized PPO Under Non-stationarity
Sun, Mingfei, Devlin, Sam, Beck, Jacob, Hofmann, Katja, Whiteson, Shimon
–arXiv.org Artificial Intelligence
We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and tuning the hyperparameters with regards to the number of agents, as predicted by our theoretical analysis.
arXiv.org Artificial Intelligence
Feb-15-2023
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- France > Hauts-de-France
- Sweden > Stockholm
- Stockholm (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Greater London > London (0.04)
- Greater Manchester > Manchester (0.04)
- Oxfordshire > Oxford (0.14)
- North America
- Canada > Quebec
- Montreal (0.04)
- United States
- California > Los Angeles County
- Long Beach (0.04)
- Colorado > Denver County
- Denver (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California > Los Angeles County
- Canada > Quebec
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.46)
- Technology: