Agent Societies
Heterogeneous-Agent Mirror Learning: A Continuum of Solutions to Cooperative MARL
Kuba, Jakub Grudzien, Feng, Xidong, Ding, Shiyao, Dong, Hao, Wang, Jun, Yang, Yaodong
The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in the artificial intelligence (AI) research community. However, many research endeavours have been focused on developing practical MARL algorithms whose effectiveness has been studied only empirically, thereby lacking theoretical guarantees. As recent studies have revealed, MARL methods often achieve performance that is unstable in terms of reward monotonicity or suboptimal at convergence. To resolve these issues, in this paper, we introduce a novel framework named Heterogeneous-Agent Mirror Learning (HAML) that provides a general template for MARL algorithmic designs. We prove that algorithms derived from the HAML template satisfy the desired properties of the monotonic improvement of the joint reward and the convergence to Nash equilibrium. We verify the practicality of HAML by proving that the current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO, are in fact HAML instances. Next, as a natural outcome of our theory, we propose HAML extensions of two well-known RL algorithms, HAA2C (for A2C) and HADDPG (for DDPG), and demonstrate their effectiveness against strong baselines on StarCraftII and Multi-Agent MuJoCo tasks.
An Introduction to Multi-Agent Reinforcement Learning and Review of its Application to Autonomous Mobility
Schmidt, Lukas M., Brosig, Johanna, Plinge, Axel, Eskofier, Bjoern M., Mutschler, Christopher
Many scenarios in mobility and traffic involve multiple different agents that need to cooperate to find a joint solution. Recent advances in behavioral planning use Reinforcement Learning to find effective and performant behavior strategies. However, as autonomous vehicles and vehicle-to-X communications become more mature, solutions that only utilize single, independent agents leave potential performance gains on the road. Multi-Agent Reinforcement Learning (MARL) is a research field that aims to find optimal solutions for multiple agents that interact with each other. This work aims to give an overview of the field to researchers in autonomous mobility. We first explain MARL and introduce important concepts. Then, we discuss the central paradigms that underlie MARL algorithms, and give an overview of state-of-the-art methods and ideas in each paradigm. With this background, we survey applications of MARL in autonomous mobility scenarios and give an overview of existing scenarios and implementations.
Constrained multi-agent ergodic area surveying control based on finite element approximation of the potential field
Ivić, Stefan, Sikirica, Ante, Crnković, Bojan
Heat Equation Driven Area Coverage (HEDAC) is a state-of-the-art multi-agent ergodic motion control guided by a gradient of a potential field. A finite element method is hereby implemented to obtain a solution of the Helmholtz partial differential equation, which models the potential field for surveying motion control. This allows us to survey arbitrarily shaped domains and to include obstacles in an elegant and robust manner intrinsic to HEDAC's fundamental idea. For a simple kinematic motion, the obstacles and boundary avoidance constraints are successfully handled by directing the agent motion with the gradient of the potential. However, including additional constraints, such as the minimal clearance distance from stationary and moving obstacles and the minimal path curvature radius, requires further alternations of the control algorithm. We introduce a relatively simple yet robust approach for handling these constraints by formulating a straightforward optimization problem based on collision-free escape route maneuvers. This approach provides a guaranteed collision avoidance mechanism while being computationally inexpensive as a result of the optimization problem partitioning. The proposed motion control is evaluated in three realistic surveying scenarios simulations, showing the effectiveness of the surveying and the robustness of the control algorithm. Furthermore, potential maneuvering difficulties due to improperly defined surveying scenarios are highlighted and we provide guidelines on how to overpass them. The results are promising and indicate real-world applicability of the proposed constrained multi-agent motion control for autonomous surveying and potentially other HEDAC utilizations.
Unified Automatic Control of Vehicular Systems with Reinforcement Learning
Yan, Zhongxia, Kreidieh, Abdul Rahman, Vinitsky, Eugene, Bayen, Alexandre M., Wu, Cathy
Emerging vehicular systems with increasing proportions of automated components present opportunities for optimal control to mitigate congestion and increase efficiency. There has been a recent interest in applying deep reinforcement learning (DRL) to these nonlinear dynamical systems for the automatic design of effective control strategies. Despite conceptual advantages of DRL being model-free, studies typically nonetheless rely on training setups that are painstakingly specialized to specific vehicular systems. This is a key challenge to efficient analysis of diverse vehicular and mobility systems. To this end, this article contributes a streamlined methodology for vehicular microsimulation and discovers high performance control strategies with minimal manual design. A variable-agent, multi-task approach is presented for optimization of vehicular Partially Observed Markov Decision Processes. The methodology is experimentally validated on mixed autonomy traffic systems, where fractions of vehicles are automated; empirical improvement, typically 15-60% over a human driving baseline, is observed in all configurations of six diverse open or closed traffic systems. The study reveals numerous emergent behaviors resembling wave mitigation, traffic signaling, and ramp metering. Finally, the emergent behaviors are analyzed to produce interpretable control strategies, which are validated against the learned control strategies.
Cooperative Actor-Critic via TD Error Aggregation
Figura, Martin, Lin, Yixuan, Liu, Ji, Gupta, Vijay
In decentralized cooperative multi-agent reinforcement learning, agents can aggregate information from one another to learn policies that maximize a team-average objective function. Despite the willingness to cooperate with others, the individual agents may find direct sharing of information about their local state, reward, and value function undesirable due to privacy issues. In this work, we introduce a decentralized actor-critic algorithm with TD error aggregation that does not violate privacy issues and assumes that communication channels are subject to time delays and packet dropouts. The cost we pay for making such weak assumptions is an increased communication burden for every agent as measured by the dimension of the transmitted data. Interestingly, the communication burden is only quadratic in the graph size, which renders the algorithm applicable in large networks. We provide a convergence analysis under diminishing step size to verify that the agents maximize the team-average objective function.
Distributed Projection-free Algorithm for Constrained Aggregative Optimization
In this paper, we focus on solving a distributed convex aggregative optimization problem in a network, where each agent has its own cost function which depends not only on its own decision variables but also on the aggregated function of all agents' decision variables. The decision variable is constrained within a feasible set. In order to minimize the sum of the cost functions when each agent only knows its local cost function, we propose a distributed Frank-Wolfe algorithm based on gradient tracking for the aggregative optimization problem where each node maintains two estimates, namely an estimate of the sum of agents' decision variable and an estimate of the gradient of global function. The algorithm is projection-free, but only involves solving a linear optimization to get a search direction at each step. We show the convergence of the proposed algorithm for convex and smooth objective functions over a time-varying network. Finally, we demonstrate the convergence and computational efficiency of the proposed algorithm via numerical simulations.
JAM: The JavaScript Agent Machine for Distributed Computing and Simulation with reactive and mobile Multi-agent Systems -- A Technical Report
Agent-based modelling (ABM), simulation (ABS), and distributed computation (ABC) are established methods. The Internet and Web-based technologies are suitable carriers. This paper is a technical report with some tutorial aspects of the JavaScript Agent Machine (JAM) platform and the programming of agents with AgentJS, a sub-set of the widely used JavaScript programming language for the programming of mobile state-based reactive agents. In addition to explaining the motivation for particular design choices and introducing core concepts of the architecture and the programming of agents in JavaScript, short examples illustrate the power of the JAM platform and its components for the deployment of large-scale multi-agent system in strong heterogeneous environments like the Internet. JAM is suitable for the deployment in strong heterogeneous and mobile environments. Finally, JAM can be used for ABC as well as for ABS in an unified methodology, finally enabling mobile crowd sensing coupled with simulation (ABS).
Fast-Replanning Motion Control for Non-Holonomic Vehicles with Aborting A*
Missura, Marcell, Roychoudhury, Arindam, Bennewitz, Maren
Autonomously driving vehicles must be able to navigate in dynamic and unpredictable environments in a collision-free manner. So far, this has only been partially achieved in driverless cars and warehouse installations where marked structures such as roads, lanes, and traffic signs simplify the motion planning and collision avoidance problem. We are presenting a new control approach for car-like vehicles that is based on an unprecedentedly fast-paced A* implementation that allows the control cycle to run at a frequency of 30 Hz. This frequency enables us to place our A* algorithm as a low-level replanning controller that is well suited for navigation and collision avoidance in virtually any dynamic environment. Due to an efficient heuristic consisting of rotate-translate-rotate motions laid out along the shortest path to the target, our Short-Term Aborting A* (STAA*) converges fast and can be aborted early in order to guarantee a high and steady control rate. While our STAA* expands states along the shortest path, it takes care of collision checking with the environment including predicted states of moving obstacles, and returns the best solution found when the computation time runs out. Despite the bounded computation time, our STAA* does not get trapped in corners due to the following of the shortest path. In simulated and real-robot experiments, we demonstrate that our control approach eliminates collisions almost entirely and is superior to an improved version of the Dynamic Window Approach with predictive collision avoidance capabilities.
Stochastic Market Games
Schmid, Kyrill, Belzner, Lenz, Müller, Robert, Tochtermann, Johannes, Linnhoff-Popien, Claudia
Some of the most relevant future applications of multi-agent systems like autonomous driving or factories as a service display mixed-motive scenarios, where agents might have conflicting goals. In these settings agents are likely to learn undesirable outcomes in terms of cooperation under independent learning, such as overly greedy behavior. Motivated from real world societies, in this work we propose to utilize market forces to provide incentives for agents to become cooperative. As demonstrated in an iterated version of the Prisoner's Dilemma, the proposed market formulation can change the dynamics of the game to consistently learn cooperative policies. Further we evaluate our approach in spatially and temporally extended settings for varying numbers of agents. We empirically find that the presence of markets can improve both the overall result and agent individual returns via their trading activities.
Why do Policy Gradient Methods work so well in Cooperative MARL? Evidence from Policy Representation
In cooperative multi-agent reinforcement learning (MARL), due to its on-policy nature, policy gradient (PG) methods are typically believed to be less sample efficient than value decomposition (VD) methods, which are off-policy. However, some recent empirical studies demonstrate that with proper input representation and hyper-parameter tuning, multi-agent PG can achieve surprisingly strong performance compared to off-policy VD methods. Why could PG methods work so well? In this post, we will present concrete analysis to show that in certain scenarios, e.g., environments with a highly multi-modal reward landscape, VD can be problematic and lead to undesired outcomes. In addition, PG methods with auto-regressive (AR) policies can learn multi-modal policies.