Agent Societies
A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning
Carion, Nicolas, Synnaeve, Gabriel, Lazaric, Alessandro, Usunier, Nicolas
Effective coordination is crucial to solve multi-agent collaborative (MAC) problems. While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenarios different from those seen during training. In this paper, we consider MAC problems with some intrinsic notion of locality (e.g., geographic proximity) such that interactions between agents and tasks are locally limited. By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. At each step, the assignment is obtained by solving a centralized optimization problem (the inference procedure) whose objective function is parameterized by a learned scoring model. We propose different combinations of inference procedures and scoring models able to represent coordination patterns of increasing complexity. The resulting assignment policy can be efficiently learned on small problem instances and readily reused in problems with more agents and tasks (i.e., zero-shot generalization). We report experimental results on a toy search and rescue problem and on several target selection scenarios in StarCraft: Brood War, in which our model significantly outperforms strong rule-based baselines on instances with 5 times more agents and tasks than those seen during training.
MAPEL: Multi-Agent Pursuer-Evader Learning using Situation Report
Verma, Sagar, Verma, Richa, Sujit, P. B.
P .B. Sujit IIIT Delhi sujit@iiitd.ac.in Abstract --In this paper, we consider a territory guarding game involving pursuers, evaders and a target in an environment that contains obstacles. The goal of the evaders is to capture the target, while that of the pursuers is to capture the evaders before they reach the target. All the agents have limited sensing range and can only detect each other when they are in their observation space. We focus on the challenge of effective cooperation between agents of a team. Finding exact solutions for such multi-agent systems is difficult because of the inherent complexity. We present Multi-Agent Pursuer-Evader Learning (MAPEL), a class of algorithms that use spatiotemporal graph representation to learn structured cooperation. The key concept is that the learning takes place in a decentralized manner and agents use situation report updates to learn about the whole environment from each others' partial observations. We use Recurrent Neural Networks (RNNs) to parameterize the spatiotemporal graph. An agent in MAPEL only updates all the other agents if an opponent or the target is inside its observation space by using situation report. We present a detailed analysis of how these two cooperation methods perform when the number of agents in the game are increased. We provide empirical results to show how agents cooperate under these two methods.
MAVEN: Multi-Agent Variational Exploration
Mahajan, Anuj, Rashid, Tabish, Samvelyan, Mikayel, Whiteson, Shimon
Centralised training with decentralised execution is an important setting for cooperative deep multi-agent reinforcement learning due to communication constraints during execution and computational tractability in training. In this paper, we analyse value-based methods that are known to have superior performance in complex environments [43]. We specifically focus on QMIX [40], the current state-of-the-art in this domain. We show that the representational constraints on the joint action-values introduced by QMIX and similar methods lead to provably poor exploration and suboptimality. Furthermore, we propose a novel approach called MAVEN that hybridises value and policy-based methods by introducing a latent space for hierarchical control. The value-based agents condition their behaviour on the shared latent variable controlled by a hierarchical policy. This allows MAVEN to achieve committed, temporally extended exploration, which is key to solving complex multi-agent tasks. Our experimental results show that MAVEN achieves significant performance improvements on the challenging SMAC domain [43].
Multiagent Rollout Algorithms and Reinforcement Learning
We consider finite and infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions, each one made by one of several agents. We introduce an algorithm, whereby at every stage, each agent's decision is made by executing a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of local computation required at every stage by each agent is independent of the number of agents, while the amount of global computation (over all agents) grows linearly with the number of agents. By contrast, with the standard rollout algorithm, the amount of global computation grows exponentially with the number of agents. Despite the drastic reduction in required computation, we show that our algorithm has the fundamental cost improvement property of rollout: an improved performance relative to the base policy. We also explore related reinforcement learning and approximate policy iteration algorithms, and we discuss how this cost improvement property is affected when we attempt to improve further the method's computational efficiency through parallelization of the agents' computations.
Visual Hide and Seek
Chen, Boyuan, Song, Shuran, Lipson, Hod, Vondrick, Carl
V ISUAL HIDE AND SEEK Boyuan Chen Columbia University Shuran Song Columbia University Hod Lipson Columbia University Carl V ondrick Columbia University A BSTRACT We train embodied agents to play Visual Hide and Seek where a prey must navigate in a simulated environment in order to avoid capture from a predator. We place a variety of obstacles in the environment for the prey to hide behind, and we only give the agents partial observations of their environment using an egocentric perspective. Although we train the model to play this game from scratch, experiments and visualizations suggest that the agent learns to predict its own visibility in the environment. Furthermore, we quantitatively analyze how agent weaknesses, such as slower speed, effect the learned policy. Our results suggest that, although agent weaknesses make the learning problem more challenging, they also cause more useful features to be learned. Our project website is available at: http://www.cs.columbia.edu/ We designed this game to mimic the typical dynamics between predator and prey. For example, we place a variety of obstacles inside the environment, which create occlusions that the agent can leverage to hide behind. We also only give the agents access to the first-person perspective of their three-dimensional environment. Consequently, this task is a substantial challenge for reinforcement learning because the state is both visual (pixel input) and partially observable (due to occlusions).
On the Utility of Learning about Humans for Human-AI Coordination
Carroll, Micah, Shah, Rohin, Ho, Mark K., Griffiths, Thomas L., Seshia, Sanjit A., Abbeel, Pieter, Dragan, Anca
While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves. Agents that assume their partner to be optimal or similar to them can converge to coordination protocols that fail to understand and be understood by humans. To demonstrate this, we introduce a simple environment that requires challenging coordination, based on the popular game Overcooked, and learn a simple model that mimics human play. We evaluate the performance of agents trained via self-play and population-based training. These agents perform very well when paired with themselves, but when paired with our human model, they are significantly worse than agents designed to play with the human model. An experiment with a planning algorithm yields the same conclusion, though only when the human-aware planner is given the exact human model that it is playing with. A user study with real humans shows this pattern as well, though less strongly. Qualitatively, we find that the gains come from having the agent adapt to the human's gameplay. Given this result, we suggest several approaches for designing agents that learn about humans in order to better coordinate with them. Code is available at https://github.com/HumanCompatibleAI/overcooked_ai.
Influence-Based Multi-Agent Exploration
Wang, Tonghan, Wang, Jianhao, Wu, Yi, Zhang, Chongjie
A BSTRACT Intrinsically motivated reinforcement learning aims to address the exploration challenge for sparse-reward tasks. However, the study of exploration methods in transition-dependent multi-agent settings is largely absent from the literature. We aim to take a step towards solving this problem. We present two exploration methods: exploration via information-theoretic influence (EITI) and exploration via decision-theoretic influence (EDTI), by exploiting the role of interaction in coordinated behaviors of agents. EITI uses mutual information to capture influence transition dynamics. EDTI uses a novel intrinsic reward, called V alue of Interaction (V oI), to characterize and quantify the influence of one agent's behavior on expected returns of other agents. By optimizing EITI or EDTI objective as a regularizer, agents are encouraged to coordinate their exploration and learn policies to optimize team performance. We show how to optimize these regularizers so that they can be easily integrated with policy gradient reinforcement learning. The resulting update rule draws a connection between coordinated exploration and intrinsic reward distribution. Finally, we empirically demonstrate the significant strength of our method in a variety of multi-agent scenarios. Many advances of deep reinforcement learning rely on a dense shaped reward function, such as distance to the goal (Mirowski et al., 2016; Wu et al., 2018), scores in games (Mnih et al., 2015) or expert-designed rewards (Wu & Tian, 2016; OpenAI, 2018), while tend to struggle in many real-world scenarios with sparse rewards.
Learning Nearly Decomposable Value Functions Via Communication Minimization
Wang, Tonghan, Wang, Jianhao, Zheng, Chongyi, Zhang, Chongjie
Reinforcement learning encounters major challenges in multi-agent settings, such as scalability and non-stationarity. Recently, value function factorization learning emerges as a promising way to address these challenges in collaborative multi-agent systems. However, existing methods have been focusing on learning fully decentralized value function, which are not efficient for tasks requiring communication. To address this limitation, this paper presents a novel framework for learning nearly decomposable value functions with communication, with which agents act on their own most of the time but occasionally send messages to other agents in order for effective coordination. This framework hybridizes value function factorization learning and communication learning by introducing two information-theoretic regularizers. These regularizers are maximizing mutual information between decentralized Q functions and communication messages while minimizing the entropy of messages between agents. We show how to optimize these regularizers in a way that is easily integrated with existing value function factorization methods such as QMIX. Finally, we demonstrate that, on the StarCraft unit micromanagement benchmark, our framework significantly outperforms baseline methods and allows to cut off more than $80\%$ communication without sacrificing the performance. The video of our experiments is available at https://sites.google.com/view/ndvf.
Using Neural Networks for Programming by Demonstration
Budhraja, Karan K., Gao, Hang, Oates, Tim
Agent-based modeling is a paradigm of modeling dynamic systems of interacting agents that are individually governed by specified behavioral rules. Training a model of such agents to produce an emergent behavior by specification of the emergent (as opposed to agent) behavior is easier from a demonstration perspective. Without the involvement of manual behavior specification via code or reliance on a defined taxonomy of possible behaviors, the demonstrator specifies the desired emergent behavior of the system over time, and retrieves agent-level parameters required to execute that motion. A low time-complexity and data requirement favoring framework for reproducing emergent behavior, given an abstract demonstration, is discussed in [1], [2]. The existing framework does, however, observe an inherent limitation in scalability because of an exponentially growing search space (with the number of agent-level parameters). Our work addresses this limitation by pursuing a more scalable architecture with the use of neural networks. While the (proof-of-concept) architecture is not suitable for many evaluated domains because of its lack of representational capacity for that domain, it is more suitable than existing work for larger datasets for the Civil Violence agent-based model.
Social Learning in Multi Agent Multi Armed Bandits
Sankararaman, Abishek, Ganesh, Ayalvadi, Shakkottai, Sanjay
In this paper, we introduce a distributed version of the classical stochastic Multi-Arm Bandit (MAB) problem. Our setting consists of a large number of agents $n$ that collaboratively and simultaneously solve the same instance of $K$ armed MAB to minimize the average cumulative regret over all agents. The agents can communicate and collaborate among each other \emph{only} through a pairwise asynchronous gossip based protocol that exchange a limited number of bits. In our model, agents at each point decide on (i) which arm to play, (ii) whether to, and if so (iii) what and whom to communicate with. Agents in our model are decentralized, namely their actions only depend on their observed history in the past. We develop a novel algorithm in which agents, whenever they choose, communicate only arm-ids and not samples, with another agent chosen uniformly and independently at random. The per-agent regret scaling achieved by our algorithm is $O \left( \frac{\lceil\frac{K}{n}\rceil+\log(n)}{\Delta} \log(T) + \frac{\log^3(n) \log \log(n)}{\Delta^2} \right)$. Furthermore, any agent in our algorithm communicates only a total of $\Theta(\log(T))$ times over a time interval of $T$. We compare our results to two benchmarks - one where there is no communication among agents and one corresponding to complete interaction. We show both theoretically and empirically, that our algorithm experiences a significant reduction both in per-agent regret when compared to the case when agents do not collaborate and in communication complexity when compared to the full interaction setting which requires $T$ communication attempts by an agent over $T$ arm pulls.