Reviews: Actor-Critic Policy Optimization in Partially Observable Multiagent Environments
–Neural Information Processing Systems
Specifically, it shows the connection by defining a new variant of an actor-critic algorithm that performs an exhaustive policy evaluation at each stage (denoted as policy-iteration-actor-critic), together with an adaptive learning rate. Then, under this setting, it is said that the actor-critic algorithm basically minimizes regret and converges to a Nash equilibrium. The paper suggests a few new versions of policy gradient update rules (Q-based Policy Gradient, Regret Policy Gradient, and Regret Matching Policy Gradient) and evaluates them on multi-agent zero-sum imperfect information games. To my understanding, Q-Based Policy Gradient is basically an advantage actor-critic algorithm (up to a transformation of the learned baseline) 3. The authors mention a "reasonable parameter sweep" over the hyperparameters. I'm curious to know the stability of the proposed actor-critic algorithms over the different trials 4. The paper should be proofread again.
Neural Information Processing Systems
Oct-8-2024, 07:37:52 GMT