Agent Societies
Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents' strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an $\epsilon$-approximate CCE in at most $\widetilde{O}( H^6S A /\epsilon^2)$ episodes, where $S$ is the number of states, $A$ is the size of the largest individual action space, and $H$ is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is fully \emph{decentralized}, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.
State-based Episodic Memory for Multi-Agent Reinforcement Learning
Multi-agent reinforcement learning (MARL) algorithms have made promising progress in recent years by leveraging the centralized training and decentralized execution (CTDE) paradigm. However, existing MARL algorithms still suffer from the sample inefficiency problem. In this paper, we propose a simple yet effective approach, called state-based episodic memory (SEM), to improve sample efficiency in MARL. SEM adopts episodic memory (EM) to supervise the centralized training procedure of CTDE in MARL. To the best of our knowledge, SEM is the first work to introduce EM into MARL. We can theoretically prove that, when using for MARL, SEM has lower space complexity and time complexity than state and action based EM (SAEM), which is originally proposed for single-agent reinforcement learning. Experimental results on StarCraft multi-agent challenge (SMAC) show that introducing episodic memory into MARL can improve sample efficiency and SEM can reduce storage cost and time cost compared with SAEM.
Learning Cooperation and Online Planning Through Simulation and Graph Convolutional Network
Mahmud, Rafid Ameer, Faisal, Fahim, Mahmud, Saaduddin, Khan, Md. Mosaddek
Multi-agent Markov Decision Process (MMDP) has been an effective way of modelling sequential decision making algorithms for multi-agent cooperative environments. A number of algorithms based on centralized and decentralized planning have been developed in this domain. However, dynamically changing environment, coupled with exponential size of the state and joint action space, make it difficult for these algorithms to provide both efficiency and scalability. Recently, Centralized planning algorithm FV-MCTS-MP and decentralized planning algorithm \textit{Alternate maximization with Behavioural Cloning} (ABC) have achieved notable performance in solving MMDPs. However, they are not capable of adapting to dynamically changing environments and accounting for the lack of communication among agents, respectively. Against this background, we introduce a simulation based online planning algorithm, that we call SiCLOP, for multi-agent cooperative environments. Specifically, SiCLOP tailors Monte Carlo Tree Search (MCTS) and uses Coordination Graph (CG) and Graph Neural Network (GCN) to learn cooperation and provides real time solution of a MMDP problem. It also improves scalability through an effective pruning of action space. Additionally, unlike FV-MCTS-MP and ABC, SiCLOP supports transfer learning, which enables learned agents to operate in different environments. We also provide theoretical discussion about the convergence property of our algorithm within the context of multi-agent settings. Finally, our extensive empirical results show that SiCLOP significantly outperforms the state-of-the-art online planning algorithms.
Simulation of emergence in artificial societies: a practical model-based approach with the EB-DEVS formalism
Foguelman, Daniel, Lanzarotti, Esteban, Ferreyra, Emanuel, Castro, Rodrigo
Modelling and simulation of complex systems is key to exploring and understanding social processes, benefiting from formal mechanisms to derive global-level properties from local-level interactions. In this paper we extend the body of knowledge on formal methods in complex systems by applying EB-DEVS, a novel formalism tailored for the modelling, simulation and live identification of emergent properties. We guide the reader through the implementation of different classical models for varied social systems to introduce good modelling practices and showcase the advantages and limitations of modelling emergence with EB-DEVS, in particular through its live emergence detection capability. This work provides case study-driven evidence for the neatness and compactness of the approach to modelling communication structures that can be explicit or implicit, static or dynamic, with or without multilevel interactions, and with weak or strong emergent behaviour. Throughout examples we show that EB-DEVS permits conceptualising the analysed societies by incorporating emergent behaviour when required, namely by integrating as a macro-level aggregate the Gini index in the Sugarscape model, Fads and Fashion in the Dissemination of Culture model, size-biased degree distribution in a Preferential Attachment model, happiness index in the Segregation model and quarantines in the SIR epidemic model. In each example we discuss the role of communication structures in the development of multilevel simulation models, and illustrate how micro-macro feedback loops enable the modelling of macro-level properties. Our results stress the relevance of multilevel features to support a robust approach in the modelling and simulation of complex systems.
HAVEN: Hierarchical Cooperative Multi-Agent Reinforcement Learning with Dual Coordination Mechanism
Xu, Zhiwei, Bai, Yunpeng, Zhang, Bin, Li, Dapeng, Fan, Guoliang
Multi-agent reinforcement learning often suffers from the exponentially larger action space caused by a large number of agents. In this paper, we propose a novel value decomposition framework HAVEN based on hierarchical reinforcement learning for the fully cooperative multi-agent problems. In order to address instabilities that arise from the concurrent optimization of high-level and low-level policies and another concurrent optimization of agents, we introduce the dual coordination mechanism of inter-layer strategies and inter-agent strategies. HAVEN does not require domain knowledge and pretraining at all, and can be applied to any value decomposition variants. Our method is demonstrated to achieve superior results to many baselines on StarCraft II micromanagement tasks and offers an efficient solution to multi-agent hierarchical reinforcement learning in fully cooperative scenarios.
Modeling the interplay between epidemics and regional socio-economics
Snellman, Jan E., Barrio, Rafael A., Kaski, Kimmo K., Käpylä, Maarit J.
In this study we present a dynamical agent-based model to investigate the interplay between the socio-economy of and SEIRS-type epidemic spreading over a geographical area, divided to smaller area districts and further to smallest area cells. The model treats the populations of cells and authorities of districts as agents, such that the former can reduce their economic activity and the latter can recommend economic activity reduction both with the overall goal to slow down the epidemic spreading. The agents make decisions with the aim of attaining as high socio-economic standings as possible relative to other agents of the same type by evaluating their standings based on the local and regional infection rates, compliance to the authorities' regulations, regional drops in economic activity, and efforts to mitigate the spread of epidemic. We find that the willingness of population to comply with authorities' recommendations has the most drastic effect on the epidemic spreading: periodic waves spread almost unimpeded in non-compliant populations, while in compliant ones the spread is minimal with chaotic spreading pattern and significantly lower infection rates. Health and economic concerns of agents turn out to have lesser roles, the former increasing their efforts and the latter decreasing them.
Decentralized Cooperative Multi-Agent Reinforcement Learning with Exploration
Mao, Weichao, Başar, Tamer, Yang, Lin F., Zhang, Kaiqing
Many real-world applications of multi-agent reinforcement learning (RL), such as multi-robot navigation and decentralized control of cyber-physical systems, involve the cooperation of agents as a team with aligned objectives. We study multi-agent RL in the most basic cooperative setting -- Markov teams -- a class of Markov games where the cooperating agents share a common reward. We propose an algorithm in which each agent independently runs stage-based V-learning (a Q-learning style algorithm) to efficiently explore the unknown environment, while using a stochastic gradient descent (SGD) subroutine for policy updates. We show that the agents can learn an $\epsilon$-approximate Nash equilibrium policy in at most $\propto\widetilde{O}(1/\epsilon^4)$ episodes. Our results advocate the use of a novel \emph{stage-based} V-learning approach to create a stage-wise stationary environment. We also show that under certain smoothness assumptions of the team, our algorithm can achieve a nearly \emph{team-optimal} Nash equilibrium. Simulation results corroborate our theoretical findings. One key feature of our algorithm is being \emph{decentralized}, in the sense that each agent has access to only the state and its local actions, and is even \emph{oblivious} to the presence of the other agents. Neither communication among teammates nor coordination by a central controller is required during learning. Hence, our algorithm can readily generalize to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.
Efficient Multi-agent Epistemic Planning: Teaching Planners About Nested Belief
Muise, Christian, Belle, Vaishak, Felli, Paolo, McIlraith, Sheila, Miller, Tim, Pearce, Adrian R., Sonenberg, Liz
In the absence of prescribed coordination, it is often necessary for individual agents to synthesize their own plans, taking into account not only their own capabilities and beliefs about the world but also their beliefs about other agents, including what each of the agents will come to believe as the consequence of the actions of others. To illustrate, consider the scenario where Larry and Moe meet on a regular basis at the local diner to swap the latest gossip. Larry has come to know that Nancy (Larry's daughter) has just received a major promotion in her job, but unbeknownst to him, Moe has already learned this bit of information through the grapevine. Before they speak, both believe Nancy is getting a promotion, Larry believes Moe is unaware of this (and consequently wishes to share the news), and Moe assumes Larry must already be aware of the promotion but is unaware of Moe's own knowledge of the situation. Very quickly we can see how the nesting of (potentially incorrect) belief can be a complicated and interesting setting to model. In this paper, we examine the problem of synthesizing plans in such settings. In particular, given a finite set of agents, each with: (1) (possibly incomplete and incorrect) beliefs about the world and about the beliefs of other agents; and (2) differing capabilities including the ability to perform actions whose outcomes are unknown to other agents; we are interested in synthesizing a plan to achieve a goal condition. Planning is at the belief level and as such, while we consider the execution of actions that can change the state of the world (ontic actions) as well as an agent's state of knowledge or belief (epistemic or more accurately doxastic actions, including communication actions), all outcomes are with respect to belief.
Collective eXplainable AI: Explaining Cooperative Strategies and Agent Contribution in Multiagent Reinforcement Learning with Shapley Values
Heuillet, Alexandre, Couthouis, Fabien, Díaz-Rodríguez, Natalia
While Explainable Artificial Intelligence (XAI) is increasingly expanding more areas of application, little has been applied to make deep Reinforcement Learning (RL) more comprehensible. As RL becomes ubiquitous and used in critical and general public applications, it is essential to develop methods that make it better understood and more interpretable. This study proposes a novel approach to explain cooperative strategies in multiagent RL using Shapley values, a game theory concept used in XAI that successfully explains the rationale behind decisions taken by Machine Learning algorithms. Through testing common assumptions of this technique in two cooperation-centered socially challenging multi-agent environments environments, this article argues that Shapley values are a pertinent way to evaluate the contribution of players in a cooperative multi-agent RL context. To palliate the high overhead of this method, Shapley values are approximated using Monte Carlo sampling. Experimental results on Multiagent Particle and Sequential Social Dilemmas show that Shapley values succeed at estimating the contribution of each agent. These results could have implications that go beyond games in economics, (e.g., for non-discriminatory decision making, ethical and responsible AI-derived decisions or policy making under fairness constraints). They also expose how Shapley values only give general explanations about a model and cannot explain a single run, episode nor justify precise actions taken by agents. Future work should focus on addressing these critical aspects.
Divergence-Regularized Multi-Agent Actor-Critic
Entropy regularization is a popular method in reinforcement learning (RL). Although it has many advantages, it alters the RL objective and makes the converged policy deviate from the optimal policy of the original Markov Decision Process. Though divergence regularization has been proposed to settle this problem, it cannot be trivially applied to cooperative multi-agent reinforcement learning (MARL). In this paper, we investigate divergence regularization in cooperative MARL and propose a novel off-policy cooperative MARL framework, divergence-regularized multi-agent actor-critic (DMAC). Mathematically, we derive the update rule of DMAC which is naturally off-policy, guarantees a monotonic policy improvement and is not biased by the regularization. DMAC is a flexible framework and can be combined with many existing MARL algorithms. We evaluate DMAC in a didactic stochastic game and StarCraft Multi-Agent Challenge and empirically show that DMAC substantially improves the performance of existing MARL algorithms.