Agent Societies
Attention-based Fault-tolerant Approach for Multi-agent Reinforcement Learning Systems
Geng, Mingyang, Xu, Kele, Li, Yiying, Liu, Shuqi, Ding, Bo, Wang, Huaimin
The aim of multi-agent reinforcement learning systems is to provide interacting agents with the ability to collaboratively learn and adapt to the behavior of other agents. In many real-world applications, the agents can only acquire a partial view of the world. However, in realistic settings, one or more agents that show arbitrarily faulty or malicious behavior may suffice to let the current coordination mechanisms fail. In this paper, we study a practical scenario considering the security issues in the presence of agents with arbitrarily faulty or malicious behavior. Under these circumstances, learning an optimal policy becomes particularly challenging, even in the unrealistic case that an agent's policy can be made conditional upon all other agents' observations. To overcome these difficulties, we present an Attention-based Fault-Tolerant (FT-Attn) algorithm which selects correct and relevant information for each agent at every time-step. The multi-head attention mechanism enables the agents to learn effective communication policies through experience concurrently to the action policies. Empirical results have shown that FT-Attn beats previous state-of-the-art methods in some complex environments and can adapt to various kinds of noisy environments without tuning the complexity of the algorithm. Furthermore, FT-Attn can effectively deal with the complex situation where an agent needs to reach multiple agents' correct observation at the same time.
RLCache: Automated Cache Management Using Reinforcement Learning
This study investigates the use of reinforcement learning to guide a general purpose cache manager decisions. Cache managers directly impact the overall performance of computer systems. They govern decisions about which objects should be cached, the duration they should be cached for, and decides on which objects to evict from the cache if it is full. These three decisions impact both the cache hit rate and size of the storage that is needed to achieve that cache hit rate. An optimal cache manager will avoid unnecessary operations, maximise the cache hit rate which results in fewer round trips to a slower backend storage system, and minimise the size of storage needed to achieve a high hit-rate. This project investigates using reinforcement learning in cache management by designing three separate agents for each of the cache manager tasks. Furthermore, the project investigates two advanced reinforcement learning architectures for multi-decision problems: a single multi-task agent and a multi-agent. We also introduce a framework to simplify the modelling of computer systems problems as a reinforcement learning task. The framework abstracts delayed experiences observations and reward assignment in computer systems while providing a flexible way to scale to multiple agents. Simulation results based on an established database benchmark system show that reinforcement learning agents can achieve a higher cache hit rate over heuristic driven algorithms while minimising the needed space. They are also able to adapt to a changing workload and dynamically adjust their caching strategy accordingly. The proposed cache manager model is generic and applicable to other types of caches, such as file system caches. This project is the first, to our knowledge, to model cache manager decisions as a multi-task control problem.
Optimal Algorithms for Submodular Maximization with Distributed Constraints
Robey, Alexander, Adibi, Arman, Schlotfeldt, Brent, Pappas, George J., Hassani, Hamed
Optimal Algorithms for Submodular Maximization with Distributed Constraints Alexander Robey, Arman Adibi, Brent Schlotfeldt, George J. Pappas, and Hamed Hassani Abstract -- We consider a class of discrete optimization problems that aim to maximize a submodular objective function subject to a distributed partition matroid constraint. More precisely, we consider a networked scenario in which multiple agents choose actions from local strategy sets with the goal of maximizing a submodular objective function defined over the set of all possible actions. Given this distributed setting, we develop Constraint-Distributed Continuous Greedy ( CDCG), a message passing algorithm that converges to the tight (1 1 /e) approximation factor of the optimum global solution using only local computation and communication. It is known that a sequential greedy algorithm can only achieve a 1 /2 multiplicative approximation of the optimal solution for this class of problems in the distributed setting. Our framework relies on lifting the discrete problem to a continuous domain and developing a consensus algorithm that achieves the tight (1 1 /e) approximation guarantee of the global discrete solution once a proper rounding scheme is applied. We also offer empirical results from a multi-agent area coverage problem to show that the proposed method significantly outperforms the state-of-the-art sequential greedy method. I. INTRODUCTION Recently, the need has arisen to design algorithms that distribute decision making among a collection of agents or computing devices. This need has been motivated by problems from statistics, machine learning and robotics. These problems include: - (Density estimation) What is the best way to estimate a nonparametric density function from a distributed dataset? Inherent to all of these applications is an underlying optimization problem that can be expressed as maximize f (S) (1a) subject to S Y, S I (1b) where f is a submodular function (i.e. it has a diminishing-returns property), Y is a finite ground set of all decision variables, and I is a family of allowable subsets of Y .
Deep Coordination Graphs
Bรถhmer, Wendelin, Kurin, Vitaly, Whiteson, Shimon
This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible trade-off between representational capacity and generalization by factorizing the joint value function of all agents according to a coordination graph into payoffs between pairs of agents. The value can be maximized by local message passing along the graph, which allows training of the value function end-to-end with Q-learning. Payoff functions are approximated with deep neural networks and parameter sharing improves generalization over the state-action space. We show that DCG can solve challenging predator-prey tasks that are vulnerable to the relative overgeneralization pathology and in which all other known value factorization approaches fail.
Interaction-Aware Multi-Agent Reinforcement Learning for Mobile Agents with Individual Goals
Mohseni-Kabir, Anahita, Isele, David, Fujimura, Kikuo
-- In a multi-agent setting, the optimal policy of a single agent is largely dependent on the behavior of other agents. We investigate the problem of multi-agent reinforcement learning, focusing on decentralized learning in non-stationary domains for mobile robot navigation. We identify a cause for the difficulty in training non-stationary policies: mutual adaptation to sub-optimal behaviors, and we use this to motivate a curriculum-based strategy for learning interactive policies. The curriculum has two stages. First, the agent leverages policy gradient algorithms to learn a policy that is capable of achieving multiple goals. Second, the agent learns a modifier policy to learn how to interact with other agents in a multi-agent setting. We evaluated our approach on both an autonomous driving lane-change domain and a robot navigation domain. Single agent reinforcement learning (RL) algorithms have made significant progress in game playing [20] and robotics [13], however, single agent learning algorithms in multi-agent settings are prone to learn stereotyped behaviors that over-fit to the training environment [22], [15]. There are several reasons why multi-agent environments are more difficult: 1) interacting with an unknown agent requires having either multiple responses to a given situation or a more nuanced ability to perceive differences. The former breaks the Markov assumption, the latter rules out simpler solutions which are likely to be found first.
Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization
Zhang, Zhi, Yang, Jiachen, Zha, Hongyuan
Traffic congestion in metropolitan areas is a world-wide problem that can be ameliorated by traffic lights that respond dynamically to real-time conditions. Recent studies applying deep reinforcement learning (RL) to optimize single traffic lights have shown significant improvement over conventional control. However, optimization of global traffic condition over a large road network fundamentally is a cooperative multi-agent control problem, for which single-agent RL is not suitable due to environment non-stationarity and infeasibility of optimizing over an exponential joint-action space. Motivated by these challenges, we propose QCOMBO, a simple yet effective multi-agent reinforcement learning (MARL) algorithm that combines the advantages of independent and centralized learning. We ensure scalability by selecting actions from individually optimized utility functions, which are shaped to maximize global performance via a novel consistency regularization loss between individual utility and a global action-value function. Experiments on diverse road topologies and traffic flow conditions in the SUMO traffic simulator show competitive performance of QCOMBO versus recent state-of-the-art MARL algorithms. We further show that policies trained on small sub-networks can effectively generalize to larger networks under different traffic flow conditions, providing empirical evidence for the suitability of MARL for intelligent traffic control.
Multi-Robot Deep Reinforcement Learning with Macro-Actions
Xiao, Yuchen, Hoffman, Joshua, Xia, Tian, Amato, Christopher
A. MacDec-POMDPs Decentralized fully collaborative multi-agent decision-making under uncertainty can be modeled as a decentralized POMDP (Dec-POMDP) [14]. Due to the assumption of synchronous actions that require the same amount of time for each agent, Dec-POMDPs are not applicable to multi-robot planning and learning scenarios in real-world. MacDec-POMDPs, formalized by introducing macro-actions into Dec-POMDPs, inherently allow asynchronous execution among robots with temporally extended macro-actions that can begin and end at different times for each agent. Formally, a MacDec-POMDP is defined as a tuple nullI,S,A, โฆ,M,ฮถ,O,T,Z,R null, where I is a finite set of agents; S is a finite set of environment states; A iA i and โฆ iโฆ i are the spaces of joint-primitive-action and joint-primitive-observation respectively; M iM i is the joint set of each agent's finite macro-action space M i; ฮถ iฮถ i is the set of joint macro-observations over agents' finite macro-observation space ฮถ i. Given a macro-action- based policy, each agent i is allowed to asynchronously choose a macro-action m i nullฮฒ m,I m,ฯ m null i that depends on individual macro-action-observation histories, where ฮฒ m: H A i [0, 1] is the stochastic termination condition and I m H M i is the initiation set of the corresponding macro-action m i, respectively depending on the primitive-action- observation history space H A i and macro-action-observation history space H M i of agent i; ฯ m: H A i A i denotes the low-level policy to achieve the macro-action m, and during the execution, each agent's primitive-observation o i โฆ i is generated according to probability observation function O i(o i,a i,s) Pr( o i a i,s), and a shared immediate reward r ( s,null a), where null a A iA i, is issued according to the reward function R: S A R .
Speeding Up Distributed Pseudo-tree Optimization Procedure with Cross Edge Consistency to Solve DCOPs
Rashik, Mashrur, Rahman, Md. Musfiqur, Mamun-or-Rashid, Md., Khan, Md. Mosaddek
Distributed Pseudo-tree Optimization Procedure (DPOP) is a well-known message passing algorithm that has been used to provide optimal solutions of Distributed Constraint Optimization Problems (DCOPs) -- a framework that is designed to optimize constraints in cooperative multi-agent systems. The traditional DCOP formulation does not consider those constraints that must be satisfied (also known as hard constraints), rather it concentrates only on soft constraints. However, the presence of both types of constraints are observed in a number of applications, such as Distributed Radio Link Frequency Assignment and Distributed Event Scheduling, etc. Although the combination of these types of constraints is recently incorporated in DPOP to solve DCOPs, scalability remains an issue for them as finding an optimal solution is NP-hard. Additionally, in DPOP, the agents are arranged as a DFS pseudo-tree. Recently it has been observed that the constructed pseudo-trees in this way often come to be chain-like and greatly impair the algorithm's performance. To address these issues, we develop an algorithm that speeds up the DPOP algorithm by reducing the size of the messages exchanged and increasing parallelism in the pseudo tree. Our empirical evidence suggests that our approach outperforms the state-of-the-art algorithms by a significant margin.
Capture the Flag: the emergence of complex cooperative agents
How did our agents perform as well as they did? First, we noticed that the agents had very fast reaction times and were very accurate taggers, which might explain their performance (tagging is a tactical action that sends opponents back to their starting point). Humans are comparatively slow to process and act on sensory input, due to our slower biological signalling. Here's an example of a reaction time test you can try yourself. Thus, our agents' superior performance might be a result of their faster visual processing and motor control.
On Memory Mechanism in Multi-Agent Reinforcement Learning
Zhou, Yilun, Asher, Derrik E., Waytowich, Nicholas R., Shah, Julie A.
Multi-agent reinforcement learning (MARL) extends (single-agent) reinforcement learning (RL) by introducing additional agents and (potentially) partial observability of the environment. Consequently, algorithms for solving MARL problems incorporate various extensions beyond traditional RL methods, such as a learned communication protocol between cooperative agents that enables exchange of private information or adaptive modeling of opponents in competitive settings. One popular algorithmic construct is a memory mechanism such that an agent's decisions can depend not only upon the current state but also upon the history of observed states and actions. In this paper, we study how a memory mechanism can be useful in environments with different properties, such as observability, internality and presence of a communication channel. Using both prior work and new experiments, we show that a memory mechanism is helpful when learning agents need to model other agents and/or when communication is constrained in some way; however we must to be cautious of agents achieving effective memoryfulness through other means.