Goto

Collaborating Authors

 matrix game


Dual Self-Awareness Value Decomposition Framework without Individual Global Max for Cooperative MARL

Neural Information Processing Systems

Value decomposition methods have gained popularity in the field of cooperative multi-agent reinforcement learning. However, almost all existing methods follow the principle of Individual Global Max (IGM) or its variants, which limits their problem-solving capabilities. To address this, we propose a dual self-awareness value decomposition framework, inspired by the notion of dual self-awareness in psychology, that entirely rejects the IGM premise. Each agent consists of an ego policy for action selection and an alter ego value function to solve the credit assignment problem. The value function factorization can ignore the IGM assumption by utilizing an explicit search procedure. On the basis of the above, we also suggest a novel anti-ego exploration mechanism to avoid the algorithm becoming stuck in a local optimum. As the first fully IGM-free value decomposition method, our proposed framework achieves desirable performance in various cooperative tasks.



Conic Blackwell Algorithm: Parameter-Free Convex-Concave Saddle-Point Solving

Neural Information Processing Systems

We develop new parameter-free and scale-free algorithms for solving convexconcave saddle-point problems. Our results are based on a new simple regret minimizer, the Conic Blackwell Algorithm+ (CBA+), which attains O(1/ T) average regret. Intuitively, our approach generalizes to other decision sets of interest ideas from the Counterfactual Regret minimization (CFR+) algorithm, which has very strong practical performance for solving sequential games on simplexes. We show how to implement CBA+ for the simplex, `p norm balls, and ellipsoidal confidence regions in the simplex, and we present numerical experiments for solving matrix games and distributionally robust optimization problems. Our empirical results show that CBA+ is a simple algorithm that outperforms state-ofthe-art methods on synthetic data and real data instances, without the need for any choice of step sizes or other algorithmic parameters.


Appendix for " Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games " Table of Contents

Neural Information Processing Systems

A.1 Proof of Theorem 1 To prove Theorem 1, we need the help of the following Lemma See Proposition 7.1 in [3]. Now we can prove our Theorem 1. Proof. For games with only one step (normal-form games, functional-form games), there is only one fixed state. Therefore, the distribution of state-action is equivalent to the distribution of the action. A.2 Proof of Theorem 2 Let us restate our Theorem 2 Theorem 2. For a given empirical payoff matrix A RM N and the reward vector aM+1 for policy M + ||(I A>(A>))aM+1||2, (18) where (A>) is the Moore-Penrose pseudoinverse of A>, and ฯƒmin(A) is the minimum singular value of A. Proof. The last equation comes from the analytic calculation of min1>ฮฒ=1 ||ฮฒ (A>) aM+1||2 using Lagrangian.


Inequity aversion improves cooperation in intertemporal social dilemmas

Neural Information Processing Systems

Groups of humans are often able to find ways to cooperate with one another in complex, temporally extended social dilemmas. Models based on behavioral economics are only able to explain this phenomenon for unrealistic stateless matrix games. Recently, multi-agent reinforcement learning has been applied to generalize social dilemma problems to temporally and spatially extended Markov games. However, this has not yet generated an agent that learns to cooperate in social dilemmas as humans do. A key insight is that many, but not all, human individuals have inequity averse social preferences. This promotes a particular resolution of the matrix game social dilemma wherein inequity-averse individuals are personally pro-social and punish defectors. Here we extend this idea to Markov games and show that it promotes cooperation in several types of sequential social dilemma, via a profitable interaction with policy learnability. In particular, we find that inequity aversion improves temporal credit assignment for the important class of intertemporal social dilemmas. These results help explain how large-scale cooperation may emerge and persist.






Regret Matching +: (In)Stability and Fast Convergence in Games

Neural Information Processing Systems

However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances [34] on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability.