Agent Societies
A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics
Several multiagent reinforcement learning (MARL) algorithms have been proposed to optimize agents' decisions. Due to the complexity of the problem, the majority of the previously developed MARL algorithms assumed agents either had some knowledge of the underlying game (such as Nash equilibria) and/or observed other agents actions and the rewards they received. We introduce a new MARL algorithm called the Weighted Policy Learner (WPL), which allows agents to reach a Nash Equilibrium (NE) in benchmark 2-player-2-action games with minimum knowledge. Using WPL, the only feedback an agent needs is its own local reward (the agent does not observe other agents actions or rewards). Furthermore, WPL does not assume that agents know the underlying game or the corresponding Nash Equilibrium a priori. We experimentally show that our algorithm converges in benchmark two-player-two-action games. We also show that our algorithm converges in the challenging Shapley's game where previous MARL algorithms failed to converge without knowing the underlying game or the NE. Furthermore, we show that WPL outperforms the state-of-the-art algorithms in a more realistic setting of 100 agents interacting and learning concurrently. An important aspect of understanding the behavior of a MARL algorithm is analyzing the dynamics of the algorithm: how the policies of multiple learning agents evolve over time as agents interact with one another. Such an analysis not only verifies whether agents using a given MARL algorithm will eventually converge, but also reveals the behavior of the MARL algorithm prior to convergence. We analyze our algorithm in two-player-two-action games and show that symbolically proving WPL's convergence is difficult, because of the non-linear nature of WPL's dynamics, unlike previous MARL algorithms that had either linear or piece-wise-linear dynamics. Instead, we numerically solve WPL's dynamics differential equations and compare the solution to the dynamics of previous MARL algorithms.
Completeness and Performance Of The APO Algorithm
Asynchronous Partial Overlay (APO) is a search algorithm that uses cooperative mediation to solve Distributed Constraint Satisfaction Problems (DisCSPs). The algorithm partitions the search into different subproblems of the DisCSP. The original proof of completeness of the APO algorithm is based on the growth of the size of the subproblems. The present paper demonstrates that this expected growth of subproblems does not occur in some situations, leading to a termination problem of the algorithm. The problematic parts in the APO algorithm that interfere with its completeness are identified and necessary modifications to the algorithm that fix these problematic parts are given. The resulting version of the algorithm, Complete Asynchronous Partial Overlay (CompAPO), ensures its completeness. Formal proofs for the soundness and completeness of CompAPO are given. A detailed performance evaluation of CompAPO comparing it to other DisCSP algorithms is presented, along with an extensive experimental evaluation of the algorithms unique behavior. Additionally, an optimization version of the algorithm, CompOptAPO, is presented, discussed, and evaluated.
Solving Multiagent Networks Using Distributed Constraint Optimization
Pearce, Jonathan P. (JP Morgan Chase) | Tambe, Milind (University of Southern California) | Maheswaran, Rajiv (University of Southern California)
In many cooperative multiagent domains, the effect of local interactions between agents can be compactly represented as a network structure. Given that agents are spread across such a network, agents directly interact only with a small group of neighbors. A distributed constraint optimization problem (DCOP) is a useful framework to reason about such networks of agents. Given agentsโ inability to communicate and collaborate in large groups in such networks, we focus on an approach called k-optimality for solving DCOPs. In this approach, agents form groups of one or more agents until no group of k or fewer agents can possibly improve the DCOP solution; we define this type of local optimum, and any algorithm guaranteed to reach such a local optimum, as k-optimal. The article provides an overview of three key results related to koptimality. The first set of results gives worst-case guarantees on the solution quality of k-optima in a DCOP. These guarantees can help determine an appropriate k-optimal algorithm, or possibly an appropriate constraint graph structure, for agents to use in situations where the cost of coordination between agents must be weighed against the quality of the solution reached. The second set of results gives upper bounds on the number of k-optima that can exist in a DCOP. These results are useful in domains where a DCOP must generate a set of solutions rather than a single solution. Finally, we sketch algorithms for k-optimality and provide some experimental results for 1-, 2- and 3-optimal algorithms for several types of DCOPs.
Introduction to the Special Issue on AI and Networks
Jardins, Marie des (University of Maryland) | Gaston, Matthew E. (Viz) | Radev, Dragomir R. (University of Michigan)
This introduction to AI Magazine's Special Issueon Networks and AI summarizes the seven articles in thespecial issue by characterizing the nature of thenetworks that are the focus of each of the papers.A short tutorial on graph theory and network structuresis included for those less familiar with the topic.
M-DPOP: Faithful Distributed Implementation of Efficient Social Choice Problems
Petcu, A., Faltings, B., Parkes, D. C.
In the efficient social choice problem, the goal is to assign values, subject to side constraints, to a set of variables to maximize the total utility across a population of agents, where each agent has private information about its utility function. In this paper we model the social choice problem as a distributed constraint optimization problem (DCOP), in which each agent can communicate with other agents that share an interest in one or more variables. Whereas existing DCOP algorithms can be easily manipulated by an agent, either by misreporting private information or deviating from the algorithm, we introduce M-DPOP, the first DCOP algorithm that provides a faithful distributed implementation for efficient social choice. This provides a concrete example of how the methods of mechanism design can be unified with those of distributed optimization. Faithfulness ensures that no agent can benefit by unilaterally deviating from any aspect of the protocol, neither information-revelation, computation, nor communication, and whatever the private information of other agents. We allow for payments by agents to a central bank, which is the only central authoritythat we require. To achieve faithfulness, we carefully integrate the Vickrey-Clarke-Groves (VCG) mechanism with the DPOP algorithm, such that each agent is only asked to perform computation, report information, and send messages that is in its own best interest. Determining agent i's payment requires solving the social choice problem without agent i. Here, we present a method to reuse computation performed in solving the main problem in a way that is robust against manipulation by the excluded agent. Experimental results on structured problems show that as much as 87% of the computation required for solving the marginal problems can be avoided by re-use, providing very good scalability in the number of agents. On unstructured problems, we observe a sensitivity of M-DPOP to the density of the problem, and we show that reusability decreases from almost 100% for very sparse problems to around 20% for highly connected problems. We close with a discussion of the features of DCOP that enable faithful implementations in this problem, the challenge of reusing computation from the main problem to marginal problems in other algorithms such as ADOPT and OptAPO, and the prospect of methods to avoid the welfare loss that can occur because of the transfer of payments to the bank.
Electric Elves: What Went Wrong and Why
Tambe, Milind (University of Southern California)
Software personal assistants continue to be a topic of signi๏ฌcant research interest. This article outlines some of the important lessons learned from a successfully-deployed team of personal assistant agents (Electric Elves) in an of๏ฌce environment. In the Electric Elves project, a team of almost a dozen personal assistant agents were continually active for seven months. Each elf (agent) represented one person and assisted in daily activities in an actual of๏ฌce environment. This project led to several important observations about privacy, adjustable autonomy, and social norms in of๏ฌce environments. In addition to outlining some of the key lessons learned we outline our continued research to address some of the concerns raised.
Graphical Model Inference in Optimal Control of Stochastic Multi-Agent Systems
van den Broek, B., Wiegerinck, W., Kappen, B.
In this article we consider the issue of optimal control in collaborative multi-agent systems with stochastic dynamics. The agents have a joint task in which they have to reach a number of target states. The dynamics of the agents contains additive control and additive noise, and the autonomous part factorizes over the agents. Full observation of the global state is assumed. The goal is to minimize the accumulated joint cost, which consists of integrated instantaneous costs and a joint end cost. The joint end cost expresses the joint task of the agents. The instantaneous costs are quadratic in the control and factorize over the agents. The optimal control is given as a weighted linear combination of single-agent to single-target controls. The single-agent to single-target controls are expressed in terms of diffusion processes. These controls, when not closed form expressions, are formulated in terms of path integrals, which are calculated approximately by Metropolis-Hastings sampling. The weights in the control are interpreted as marginals of a joint distribution over agent to target assignments. The structure of the latter is represented by a graphical model, and the marginals are obtained by graphical model inference. Exact inference of the graphical model will break down in large systems, and so approximate inference methods are needed. We use naive mean field approximation and belief propagation to approximate the optimal control in systems with linear dynamics. We compare the approximate inference methods with the exact solution, and we show that they can accurately compute the optimal control. Finally, we demonstrate the control method in multi-agent systems with nonlinear dynamics consisting of up to 80 agents that have to reach an equal number of target states.
On Local Rewards and Scaling Distributed Reinforcement Learning
We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n.
On Local Rewards and Scaling Distributed Reinforcement Learning
We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n.
On Local Rewards and Scaling Distributed Reinforcement Learning
We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lowerbound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require anumber of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance witha number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n.