Agent Societies
Learning to Schedule Communication in Multi-agent Reinforcement Learning
Kim, Daewoo, Moon, Sangwoo, Hostallero, David, Kang, Wan Ju, Lee, Taeyoung, Son, Kyunghwan, Yi, Yung
Many real-world reinforcement learning tasks require multiple agents to make sequential decisions under the agents' interaction, where well-coordinated actions among the agents are crucial to achieve the target goal better at these tasks. One way to accelerate the coordination effect is to enable multiple agents to communicate with each other in a distributed manner and behave as a group. In this paper, we study a practical scenario when (i) the communication bandwidth is limited and (ii) the agents share the communication medium so that only a restricted number of agents are able to simultaneously use the medium, as in the state-of-the-art wireless networking standards. This calls for a certain form of communication scheduling. In that regard, we propose a multi-agent deep reinforcement learning framework, called SchedNet, in which agents learn how to schedule themselves, how to encode the messages, and how to select actions based on received messages. SchedNet is capable of deciding which agents should be entitled to broadcasting their (encoded) messages, by learning the importance of each agent's partially observed information. We evaluate SchedNet against multiple baselines under two different applications, namely, cooperative communication and navigation, and predator-prey. Our experiments show a non-negligible performance gap between SchedNet and other mechanisms such as the ones without communication and with vanilla scheduling methods, e.g., round robin, ranging from 32% to 43%.
Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning
Qu, Chao, Mannor, Shie, Xu, Huan, Qi, Yuan, Song, Le, Xiong, Junwu
We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve the joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume that the reward function for each agent can be different and observed only locally by the agent itself. Furthermore, each agent is located at a node of a communication network and can exchanges information only with its neighbors. Using softmax temporal consistency and a decentralized optimization method, we obtain a principled and data-efficient iterative algorithm. In the first step of each iteration, an agent computes its local policy and value gradients and then updates only policy parameters. In the second step, the agent propagates to its neighbors the messages based on its value function and then updates its own value function. Hence we name the algorithm value propagation. We prove a non-asymptotic convergence rate 1/T with the nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with convergence guarantee in the control, off-policy and non-linear function approximation setting. We empirically demonstrate the effectiveness of our approach in experiments.
Distributed Policy Iteration for Scalable Approximation of Cooperative Multi-Agent Policies
Phan, Thomy, Schmid, Kyrill, Belzner, Lenz, Gabor, Thomas, Feld, Sebastian, Linnhoff-Popien, Claudia
Decision making in multi-agent systems (MAS) is a great challenge due to enormous state and joint action spaces as well as uncertainty, making centralized control generally infeasible. Decentralized control offers better scalability and robustness but requires mechanisms to coordinate on joint tasks and to avoid conflicts. Common approaches to learn decentralized policies for cooperative MAS suffer from non-stationarity and lacking credit assignment, which can lead to unstable and uncoordinated behavior in complex environments. In this paper, we propose Strong Emergent Policy approximation (STEP), a scalable approach to learn strong decentralized policies for cooperative MAS with a distributed variant of policy iteration. For that, we use function approximation to learn from action recommendations of a decentralized multi-agent planning algorithm. STEP combines decentralized multi-agent planning with centralized learning, only requiring a generative model for distributed black box optimization. We experimentally evaluate STEP in two challenging and stochastic domains with large state and joint action spaces and show that STEP is able to learn stronger policies than standard multi-agent reinforcement learning algorithms, when combining multi-agent open-loop planning with centralized function approximation. The learned policies can be reintegrated into the multi-agent planning process to further improve performance.
Feudal Multi-Agent Hierarchies for Cooperative Reinforcement Learning
Ahilan, Sanjeevan, Dayan, Peter
We investigate how reinforcement learning agents can learn to cooperate. Drawing inspiration from human societies, in which successful coordination of many individuals is often facilitated by hierarchical organisation, we introduce Feudal Multi-agent Hierarchies (FMH). In this framework, a 'manager' agent, which is tasked with maximising the environmentally-determined reward function, learns to communicate subgoals to multiple, simultaneously-operating, 'worker' agents. Workers, which are rewarded for achieving managerial subgoals, take concurrent actions in the world. We outline the structure of FMH and demonstrate its potential for decentralised learning and control. We find that, given an adequate set of subgoals from which to choose, FMH performs, and particularly scales, substantially better than cooperative approaches that use a shared reward function.
TSA Says the Number of Agents Skipping Work Has Spiked Due to the Shutdown
Transportation Security Administration agents help passengers through a security checkpoint at Newark Liberty International Airport in Newark. New figures released Sunday reveal a record number of agents are not showing up to work. The Transportation Security Administration has reported that the number of airport security agents not showing up to work reached an all-time high over the holiday weekend, according to the Washington Post, a side-effect of the government shutdown that the Department of Homeland Security previously stated was non a concern. TSA agents are among the estimated 800,000 federal workers who are furloughed or working without pay during a government shutdown that is reaching its 30th day. The Washington Post reported that the number of unscheduled absences hit 8 percent nationally this weekend, up from a 3 percent a year ago.
Theory of Minds: Understanding Behavior in Groups Through Inverse Planning
Shum, Michael, Kleiman-Weiner, Max, Littman, Michael L., Tenenbaum, Joshua B.
Human social behavior is structured by relationships. We form teams, groups, tribes, and alliances at all scales of human life. These structures guide multi-agent cooperation and competition, but when we observe others these underlying relationships are typically unobservable and hence must be inferred. Humans make these inferences intuitively and flexibly, often making rapid generalizations about the latent relationships that underlie behavior from just sparse and noisy observations. Rapid and accurate inferences are important for determining who to cooperate with, who to compete with, and how to cooperate in order to compete. Towards the goal of building machine-learning algorithms with human-like social intelligence, we develop a generative model of multi-agent action understanding based on a novel representation for these latent relationships called Composable Team Hierarchies (CTH). This representation is grounded in the formalism of stochastic games and multi-agent reinforcement learning. We use CTH as a target for Bayesian inference yielding a new algorithm for understanding behavior in groups that can both infer hidden relationships as well as predict future actions for multiple agents interacting together. Our algorithm rapidly recovers an underlying causal model of how agents relate in spatial stochastic games from just a few observations. The patterns of inference made by this algorithm closely correspond with human judgments and the algorithm makes the same rapid generalizations that people do.
Improving Coordination in Multi-Agent Deep Reinforcement Learning through Memory-driven Communication
Pesce, Emanuele, Montana, Giovanni
Deep reinforcement learning algorithms have recently been used to train multiple interacting agents in a centralised manner whilst keeping their execution decentralised. When the agents can only acquire partial observations and are faced with a task requiring coordination and synchronisation skills, inter-agent communication plays an essential role. In this work, we propose a framework for multi-agent training using deep deterministic policy gradients that enables the concurrent, end-to-end learning of an explicit communication protocol through a memory device. During training, the agents learn to perform read and write operations enabling them to infer a shared representation of the world. We empirically demonstrate that concurrent learning of the communication device and individual policies can improve inter-agent coordination and performance, and illustrate how different communication patterns can emerge for different tasks.
Transparent Machine Education of Neural Networks for Swarm Shepherding Using Curriculum Design
Gee, Alexander, Abbass, Hussein
Swarm control is a difficult problem due to the need to guide a large number of agents simultaneously. We cast the problem as a shepherding problem, similar to biological dogs guiding a group of sheep towards a goal. The shepherd needs to deal with complex and dynamic environments and make decisions in order to direct the swarm from one location to another. In this paper, we design a novel curriculum to teach an artificial intelligence empowered agent to shepherd in the presence of the large state space associated with the shepherding problem and in a transparent manner. The results show that a properly designed curriculum could indeed enhance the speed of learning and the complexity of learnt behaviours.
Global collaboration needed for future space missions
Japan is launching multiple missions to explore the mysteries of the solar system in the coming years, joining hands with the European Union and countries such as India to compete with space superpowers such as the United States and Russia. The ultimate goal of space exploration is "to expand the areas of activities for humans and find another habitable planet. I believe there is a possibility that we can colonize Mars," said Hitoshi Kuninaka, a vice president of the Japan Aerospace Exploration Agency (JAXA). In 2018, Japan made history by landing two small rovers from the space probe Hayabusa2 on the surface of an asteroid 300 million kilometers from Earth. Hayabusa2's touchdown on the Ryugu asteroid is expected in late January this year.
Inequity aversion improves cooperation in intertemporal social dilemmas
Hughes, Edward, Leibo, Joel Z., Phillips, Matthew, Tuyls, Karl, Dueñez-Guzman, Edgar, Castañeda, Antonio García, Dunning, Iain, Zhu, Tina, McKee, Kevin, Koster, Raphael, Roff, Heather, Graepel, Thore
Groups of humans are often able to find ways to cooperate with one another in complex, temporally extended social dilemmas. Models based on behavioral economics are only able to explain this phenomenon for unrealistic stateless matrix games. Recently, multi-agent reinforcement learning has been applied to generalize social dilemma problems to temporally and spatially extended Markov games. However, this has not yet generated an agent that learns to cooperate in social dilemmas as humans do. A key insight is that many, but not all, human individuals have inequity averse social preferences. This promotes a particular resolution of the matrix game social dilemma wherein inequity-averse individuals are personally pro-social and punish defectors. Here we extend this idea to Markov games and show that it promotes cooperation in several types of sequential social dilemma, via a profitable interaction with policy learnability. In particular, we find that inequity aversion improves temporal credit assignment for the important class of intertemporal social dilemmas. These results help explain how large-scale cooperation may emerge and persist.