Goto

Collaborating Authors

 Agent Societies


Can Reinforcement Learning Find Stackelberg-Nash Equilibria in General-Sum Markov Games with Myopic Followers?

arXiv.org Machine Learning

Reinforcement learning (RL) has achieved striking empirical successes in solving real-world sequential decision-making problems (Mnih et al., 2015; Duan et al., 2016; Silver et al., 2016, 2017, 2018; Agostinelli et al., 2019; Akkaya et al., 2019). Motivated by these successes, multi-agent extensions of RL algorithms have recently become popular in decision-making problems involving multiple interacting agents (Busoniu et al., 2008; Hernandez-Leal et al., 2018, 2019; OroojlooyJadid and Hajinezhad, 2019; Zhang et al., 2019). Multi-agent RL is often modeled as a Markov game (Littman, 1994) where, at each time step, given the state of the environment, each player (agent) takes an action simultaneously, observes her own immediate reward, and the environment evolves into a next state. Here both the reward of each player and the state transition depend on the actions of all players. From the perspective of each player, her goal is to find a policy that maximizes her expected total reward in the presence of other agents. In Markov games, depending on the structure of the reward functions, the relationship among the players can be either collaborative, where each player has the same reward function, or competitive, where the sum of the reward function is equal to zero, or mixed, which corresponds to a general-sum game. While most of the existing theoretical results focus on the collaborative or two-player competitive settings, the mixed setting is oftentimes more pertinent to real-world multi-agent applications. Moreover, in addition to having diverse reward functions, the players might also have asymmetric roles in the Markov game--the players might be divided into leaders and followers, where the leaders' joint policy determines a general-sum game for the followers.


Learning Cooperative Multi-Agent Policies with Partial Reward Decoupling

arXiv.org Artificial Intelligence

One of the preeminent obstacles to scaling multi-agent reinforcement learning to large numbers of agents is assigning credit to individual agents' actions. In this paper, we address this credit assignment problem with an approach that we call \textit{partial reward decoupling} (PRD), which attempts to decompose large cooperative multi-agent RL problems into decoupled subproblems involving subsets of agents, thereby simplifying credit assignment. We empirically demonstrate that decomposing the RL problem using PRD in an actor-critic algorithm results in lower variance policy gradient estimates, which improves data efficiency, learning stability, and asymptotic performance across a wide array of multi-agent RL tasks, compared to various other actor-critic approaches. Additionally, we relate our approach to counterfactual multi-agent policy gradient (COMA), a state-of-the-art MARL algorithm, and empirically show that our approach outperforms COMA by making better use of information in agents' reward streams, and by enabling recent advances in advantage estimation to be used.


Local Advantage Networks for Cooperative Multi-Agent Reinforcement Learning

arXiv.org Artificial Intelligence

Multi-agent reinforcement learning (MARL) enables us to create adaptive agents in challenging environments, even when the agents have limited observation. Modern MARL methods have hitherto focused on finding factorized value functions. While this approach has proven successful, the resulting methods have convoluted network structures. We take a radically different approach, and build on the structure of independent Q-learners. Inspired by influence-based abstraction, we start from the observation that compact representations of the observation-action histories can be sufficient to learn close to optimal decentralized policies. Combining this observation with a dueling architecture, our algorithm, LAN, represents these policies as separate individual advantage functions w.r.t. a centralized critic. These local advantage networks condition only on a single agent's local observation-action history. The centralized value function conditions on the agents' representations as well as the full state of the environment. The value function, which is cast aside before execution, serves as a stabilizer that coordinates the learning and to formulate DQN targets during learning. In contrast with other methods, this enables LAN to keep the number of network parameters of its centralized network independent in the number of agents, without imposing additional constraints like monotonic value functions. When evaluated on the StarCraft multi-agent challenge benchmark, LAN shows state-of-the-art performance and scores more than 80% wins in two previously unsolved maps `corridor' and `3s5z_vs_3s6z', leading to an improvement of 10% over QPLEX on average performance on the 14 maps. Moreover when the number of agents becomes large, LAN uses significantly fewer parameters than QPLEX or even QMIX. We thus show that LAN's structure forms a key improvement that helps MARL methods remain scalable.


Variational Automatic Curriculum Learning for Sparse-Reward Cooperative Multi-Agent Problems

arXiv.org Artificial Intelligence

We introduce a curriculum learning algorithm, Variational Automatic Curriculum Learning (VACL), for solving challenging goal-conditioned cooperative multi-agent reinforcement learning problems. We motivate our paradigm through a variational perspective, where the learning objective can be decomposed into two terms: task learning on the current task distribution, and curriculum update to a new task distribution. Local optimization over the second term suggests that the curriculum should gradually expand the training tasks from easy to hard. Our VACL algorithm implements this variational paradigm with two practical components, task expansion and entity progression, which produces training curricula over both the task configurations as well as the number of entities in the task. Experiment results show that VACL solves a collection of sparse-reward problems with a large number of agents. Particularly, using a single desktop machine, VACL achieves 98% coverage rate with 100 agents in the simple-spread benchmark and reproduces the ramp-use behavior originally shown in OpenAI's hide-and-seek project. Our project website is at https://sites.google.com/view/vacl-neurips-2021.


Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks

arXiv.org Artificial Intelligence

Offline reinforcement learning leverages previously-collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the increased interactions among agents and with the enviroment. Yet, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor datasets or benchmarks for offline MARL research are available. In this paper, we facilitate the research by providing large-scale datasets, and use them to examine the usage of the Decision Transformer in the context of MARL. We investigate the generalisation of MARL offline pre-training in the following three aspects: 1) between single agents and multiple agents, 2) from offline pretraining to the online fine-tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages transformer's modelling ability of sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A crucial benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scenarios. On StarCraft II offline dataset, MADT outperforms the state-of-the-art offline RL baselines. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency, and enjoys strong performance both few-short and zero-shot cases. To our best knowledge, this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalisability enhancements in MARL.


Meta-CPR: Generalize to Unseen Large Number of Agents with Communication Pattern Recognition Module

arXiv.org Artificial Intelligence

Designing an effective communication mechanism among agents in reinforcement learning has been a challenging task, especially for real-world applications. The number of agents can grow or an environment sometimes needs to interact with a changing number of agents in real-world scenarios. To this end, a multi-agent framework needs to handle various scenarios of agents, in terms of both scales and dynamics, for being practical to real-world applications. We formulate the multi-agent environment with a different number of agents as a multi-tasking problem and propose a meta reinforcement learning (meta-RL) framework to tackle this problem. The proposed framework employs a meta-learned Communication Pattern Recognition (CPR) module to identify communication behavior and extract information that facilitates the training process. Experimental results are poised to demonstrate that the proposed framework (a) generalizes to an unseen larger number of agents and (b) allows the number of agents to change between episodes. The ablation study is also provided to reason the proposed CPR design and show such design is effective.


Value Function Factorisation with Hypergraph Convolution for Cooperative Multi-agent Reinforcement Learning

arXiv.org Artificial Intelligence

Cooperation between agents in a multi-agent system (MAS) has become a hot topic in recent years, and many algorithms based on centralized training with decentralized execution (CTDE), such as VDN and QMIX, have been proposed. However, these methods disregard the information hidden in the individual action values. In this paper, we propose HyperGraph CoNvolution MIX (HGCN-MIX), a method that combines hypergraph convolution with value decomposition. By treating action values as signals, HGCN-MIX aims to explore the relationship between these signals via a self-learning hypergraph. Experimental results present that HGCN-MIX matches or surpasses state-of-the-art techniques in the StarCraft II multi-agent challenge (SMAC) benchmark on various situations, notably those with a number of agents.


Self-Organized Polynomial-Time Coordination Graphs

arXiv.org Artificial Intelligence

Coordination graph is a promising approach to model agent collaboration in multi-agent reinforcement learning. It factorizes a large multi-agent system into a suite of overlapping groups that represent the underlying coordination dependencies. One critical challenge in this paradigm is the complexity of computing maximum-value actions for a graph-based value factorization. It refers to the decentralized constraint optimization problem (DCOP), which and whose constant-ratio approximation are NP-hard problems. To bypass this fundamental hardness, this paper proposes a novel method, named Self-Organized Polynomial-time Coordination Graphs (SOP-CG), which uses structured graph classes to guarantee the optimality of the induced DCOPs with sufficient function expressiveness. We extend the graph topology to be state-dependent, formulate the graph selection as an imaginary agent, and finally derive an end-to-end learning paradigm from the unified Bellman optimality equation. In experiments, we show that our approach learns interpretable graph topologies, induces effective coordination, and improves performance across a variety of cooperative multi-agent tasks.


Adversarial Attacks in Cooperative AI

arXiv.org Artificial Intelligence

Single-agent reinforcement learning algorithms in a multi-agent environment are inadequate for fostering cooperation. If intelligent agents are to interact and work together to solve complex problems, methods that counter non-cooperative behavior are needed to facilitate the training of multiple agents. This is the goal of cooperative AI. Recent work in adversarial machine learning, however, shows that models (e.g., image classifiers) can be easily deceived into making incorrect decisions. In addition, some past research in cooperative AI has relied on new notions of representations, like public beliefs, to accelerate the learning of optimally cooperative behavior. Hence, cooperative AI might introduce new weaknesses not investigated in previous machine learning research. In this paper, our contributions include: (1) arguing that three algorithms inspired by human-like social intelligence introduce new vulnerabilities, unique to cooperative AI, that adversaries can exploit, and (2) an experiment showing that simple, adversarial perturbations on the agents' beliefs can negatively impact performance. This evidence points to the possibility that formal representations of social behavior are vulnerable to adversarial attacks.


On the Online Coalition Structure Generation Problem

Journal of Artificial Intelligence Research

We consider the online version of the coalition structure generation problem, in which agents, corresponding to the vertices of a graph, appear in an online fashion and have to be partitioned into coalitions by an authority (i.e., an online algorithm). When an agent appears, the algorithm has to decide whether to put the agent into an existing coalition or to create a new one containing, at this moment, only her. The decision is irrevocable. The objective is partitioning agents into coalitions so as to maximize the resulting social welfare that is the sum of all coalition values. We consider two cases for the value of a coalition: (1) the sum of the weights of its edges, and (2) the sum of the weights of its edges divided by its size. Coalition structures appear in a variety of application in AI, multi-agent systems, networks, as well as in social networks, data analysis, computational biology, game theory, and scheduling. For each of the coalition value functions we consider the bounded and unbounded cases depending on whether or not the size of a coalition can exceed a given value α. Furthermore, we consider the case of a limited number of coalitions and various weight functions for the edges, i.e., unrestricted, positive and constant weights. We show tight or nearly tight bounds for the competitive ratio in each case.