Agent Societies
Information State Embedding in Partially Observable Cooperative Multi-Agent Reinforcement Learning
Mao, Weichao, Zhang, Kaiqing, Miehling, Erik, Başar, Tamer
Multi-agent reinforcement learning (MARL) under partial observability has long been considered challenging, primarily due to the requirement for each agent to maintain a belief over all other agents' local histories -- a domain that generally grows exponentially over time. In this work, we investigate a partially observable MARL problem in which agents are cooperative. To enable the development of tractable algorithms, we introduce the concept of an information state embedding that serves to compress agents' histories. We quantify how the compression error influences the resulting value functions for decentralized control. Furthermore, we propose three natural embeddings, based on finite-memory truncation, principal component analysis, and recurrent neural networks. The output of these embeddings are then used as the information state, and can be fed into any MARL algorithm. The proposed embed-then-learn pipeline opens the black-box of existing MARL algorithms, allowing us to establish some theoretical guarantees (error bounds of value functions) while still achieving competitive performance with many end-to-end approaches.
Anytime and Efficient Coalition Formation with Spatial and Temporal Constraints
Capezzuto, Luca, Tarapore, Danesh, Ramchurn, Sarvapali D.
The Coalition Formation with Spatial and Temporal constraints Problem (CFSTP) is a multi-agent task allocation problem where the agents are cooperative and few, the tasks are many, spatially distributed, with deadlines and workloads, and the objective is to find a schedule that maximises the number of completed tasks. The current state-of-the-art CFSTP solver, the Coalition Formation with Look-Ahead (CFLA) algorithm, has two main limitations. First, its time complexity is quadratic with the number of tasks and exponential with the number of agents, which makes it not efficient. Second, its look-ahead technique is not effective in real-world scenarios, such as open multi-agent systems, where new tasks can appear at any time. Motivated by this, we propose an extension of CFLA, which we call Coalition Formation with Improved Look-Ahead (CFLA+). Since CFLA+ inherits the limitations of CFLA, we also develop a novel algorithm to solve the CFSTP, the first to be both anytime and efficient, which we call Cluster-based Coalition Formation (CCF). We empirically show that, in settings where the look-ahead technique is highly effective, CCF completes up to 20% (resp. 10%) more tasks than CFLA (resp. CFLA+) while being up to four orders of magnitude faster. Our results affirm CCF as the new state-of-the-art CFSTP solver.
Counterfactual Multi-Agent Reinforcement Learning with Graph Convolution Communication
Su, Jianyu, Adams, Stephen, Beling, Peter A.
We consider a fully cooperative multi-agent system where agents cooperate to maximize a system's utility in a partial-observable environment. We propose that multi-agent systems must have the ability to (1) communicate and understand the inter-plays between agents and (2) correctly distribute rewards based on an individual agent's contribution. In contrast, most work in this setting considers only one of the above abilities. In this study, we develop an architecture that allows for communication among agents and tailors the system's reward for each individual agent. Our architecture represents agent communication through graph convolution and applies an existing credit assignment structure, counterfactual multi-agent policy gradient (COMA), to assist agents to learn communication by back-propagation. The flexibility of the graph structure enables our method to be applicable to a variety of multi-agent systems, e.g. dynamic systems that consist of varying numbers of agents and static systems with a fixed number of agents. We evaluate our method on a range of tasks, demonstrating the advantage of marrying communication with credit assignment. In the experiments, our proposed method yields better performance than the state-of-art methods, including COMA. Moreover, we show that the communication strategies offers us insights and interpretability of the system's cooperative policies.
Mimicking Evolution with Reinforcement Learning
Abrantes, João P., Abrantes, Arnaldo J., Oliehoek, Frans A.
Evolution gave rise to human and animal intelligence here on Earth. We argue that the path to developing artificial human-like-intelligence will pass through mimicking the evolutionary process in a nature-like simulation. In Nature, there are two processes driving the development of the brain: evolution and learning. Evolution acts slowly, across generations, and amongst other things, it defines what agents learn by changing their internal reward function. Learning acts fast, across one's lifetime, and it quickly updates agents' policy to maximise pleasure and minimise pain. The reward function is slowly aligned with the fitness function by evolution, however, as agents evolve the environment and its fitness function also change, increasing the misalignment between reward and fitness. It is extremely computationally expensive to replicate these two processes in simulation. This work proposes Evolution via Evolutionary Reward (EvER) that allows learning to single-handedly drive the search for policies with increasingly evolutionary fitness by ensuring the alignment of the reward function with the fitness function. In this search, EvER makes use of the whole state-action trajectories that agents go through their lifetime. In contrast, current evolutionary algorithms discard this information and consequently limit their potential efficiency at tackling sequential decision problems. We test our algorithm in two simple bio-inspired environments and show its superiority at generating more capable agents at surviving and reproducing their genes when compared with a state-of-the-art evolutionary algorithm.
Researchers propose paradigm that trains AI agents through evolution
A paper published by researchers at Carnegie Mellon University, San Francisco research firm OpenAI, Facebook AI Research, the University of California at Berkeley, and Shanghai Jiao Tong University describes a paradigm that scales up multi-agent reinforcement learning, where AI models learn by having agents interact within an environment such that the agent population increases in size over time. By maintaining sets of agents in each training stage and performing mix-and-match and fine-tuning steps over these sets, the coauthors say the paradigm -- Evolutionary Population Curriculum -- is able to promote agents with the best adaptability to the next stage. In computer science, evolutionary computation is the family of algorithms for global optimization inspired by biological evolution. Instead of following explicit mathematical gradients, these models generate variants, test them, and retain the top performers. They've shown promise in early work by OpenAI, Google, Uber, and others, but they're somewhat tough to prototype because there's a dearth of tools targeting evolutionary algorithms and natural evolution strategies (NES).
Parallel Knowledge Transfer in Multi-Agent Reinforcement Learning
Multi-agent reinforcement learning is a standard framework for modeling multi-agent interactions applied in real-world scenarios. Inspired by experience sharing in human groups, learning knowledge parallel reusing between agents can potentially promote team learning performance, especially in multi-task environments. When all agents interact with the environment and learn simultaneously, how each independent agent selectively learns from other agents' behavior knowledge is a problem that we need to solve. This paper proposes a novel knowledge transfer framework in MARL, PAT (Parallel Attentional Transfer). We design two acting modes in PAT, student mode and self-learning mode. Each agent in our approach trains a decentralized student actor-critic to determine its acting mode at each time step. When agents are unfamiliar with the environment, the shared attention mechanism in student mode effectively selects learning knowledge from other agents to decide agents' actions. PAT outperforms state-of-the-art empirical evaluation results against the prior advising approaches. Our approach not only significantly improves team learning rate and global performance, but also is flexible and transferable to be applied in various multi-agent systems.
Multi-Agent Reinforcement Learning for Problems with Combined Individual and Team Reward
Sheikh, Hassam Ullah, Bölöni, Ladislau
Many cooperative multi-agent problems require agents to learn individual tasks while contributing to the collective success of the group. This is a challenging task for current state-of-the-art multi-agent reinforcement algorithms that are designed to either maximize the global reward of the team or the individual local rewards. The problem is exacerbated when either of the rewards is sparse leading to unstable learning. To address this problem, we present Decomposed Multi-Agent Deep Deterministic Policy Gradient (DE-MADDPG): a novel cooperative multi-agent reinforcement learning framework that simultaneously learns to maximize the global and local rewards. We evaluate our solution on the challenging defensive escort team problem and show that our solution achieves a significantly better and more stable performance than the direct adaptation of the MADDPG algorithm.
Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning
Long, Qian, Zhou, Zihan, Gupta, Abhibav, Fang, Fei, Wu, Yi, Wang, Xiaolong
In multi-agent games, the complexity of the environment can grow exponentially as the number of agents increases, so it is particularly challenging to learn good policies when the agent population is large. In this paper, we introduce Evolutionary Population Curriculum (EPC), a curriculum learning paradigm that scales up Multi-Agent Reinforcement Learning (MARL) by progressively increasing the population of training agents in a stage-wise manner. Furthermore, EPC uses an evolutionary approach to fix an objective misalignment issue throughout the curriculum: agents successfully trained in an early stage with a small population are not necessarily the best candidates for adapting to later stages with scaled populations. Concretely, EPC maintains multiple sets of agents in each stage, performs mix-and-match and fine-tuning over these sets and promotes the sets of agents with the best adaptability to the next stage. We implement EPC on a popular MARL algorithm, MADDPG, and empirically show that our approach consistently outperforms baselines by a large margin as the number of agents grows exponentially. The project page is https://sites.google.com/view/epciclr2020.
A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning
Carion, Nicolas, Usunier, Nicolas, Synnaeve, Gabriel, Lazaric, Alessandro
Effective coordination is crucial to solve multi-agent collaborative (MAC) problems. While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenarios different from those seen during training. In this paper, we consider MAC problems with some intrinsic notion of locality (e.g., geographic proximity) such that interactions between agents and tasks are locally limited. By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. At each step, the assignment is obtained by solving a centralized optimization problem (the inference procedure) whose objective function is parameterized by a learned scoring model.
Deep Multi-Agent Reinforcement Learning for Decentralized Continuous Cooperative Control
de Witt, Christian Schroeder, Peng, Bei, Kamienny, Pierre-Alexandre, Torr, Philip, Böhmer, Wendelin, Whiteson, Shimon
Deep multi-agent reinforcement learning (MARL) holds the promise of automating many real-world cooperative robotic manipulation and transportation tasks. Nevertheless, decentralised cooperative robotic control has received less attention from the deep reinforcement learning community, as compared to single-agent robotics and multi-agent games with discrete actions. To address this gap, this paper introduces Multi-Agent Mujoco, an easily extensible multi-agent benchmark suite for robotic control in continuous action spaces. The benchmark tasks are diverse and admit easily configurable partially observable settings. Inspired by the success of single-agent continuous value-based algorithms in robotic control, we also introduce COMIX, a novel extension to a common discrete action multi-agent $Q$-learning algorithm. We show that COMIX significantly outperforms state-of-the-art MADDPG on a partially observable variant of a popular particle environment and matches or surpasses it on Multi-Agent Mujoco. Thanks to this new benchmark suite and method, we can now pose an interesting question: what is the key to performance in such settings, the use of value-based methods instead of policy gradients, or the factorisation of the joint $Q$-function? To answer this question, we propose a second new method, FacMADDPG, which factors MADDPG's critic. Experimental results on Multi-Agent Mujoco suggest that factorisation is the key to performance.