Agent Societies
Leveraging Statistical Multi-Agent Online Planning with Emergent Value Function Approximation
Phan, Thomy, Belzner, Lenz, Gabor, Thomas, Schmid, Kyrill
Making decisions is a great challenge in distributed autonomous environments due to enormous state spaces and uncertainty. Many online planning algorithms rely on statistical sampling to avoid searching the whole state space, while still being able to make acceptable decisions. However, planning often has to be performed under strict computational constraints making online planning in multi-agent systems highly limited, which could lead to poor system performance, especially in stochastic domains. In this paper, we propose Emergent Value function Approximation for Distributed Environments (EVADE), an approach to integrate global experience into multi-agent online planning in stochastic domains to consider global effects during local planning. For this purpose, a value function is approximated online based on the emergent system behaviour by using methods of reinforcement learning. We empirically evaluated EVADE with two statistical multi-agent online planning algorithms in a highly complex and stochastic smart factory environment, where multiple agents need to process various items at a shared set of machines. Our experiments show that EVADE can effectively improve the performance of multi-agent online planning while offering efficiency w.r.t. the breadth and depth of the planning process.
Decentralized Monte Carlo Tree Search for Partially Observable Multi-agent Pathfinding
Skrynnik, Alexey, Andreychuk, Anton, Yakovlev, Konstantin, Panov, Aleksandr
The Multi-Agent Pathfinding (MAPF) problem involves finding a set of conflict-free paths for a group of agents confined to a graph. In typical MAPF scenarios, the graph and the agents' starting and ending vertices are known beforehand, allowing the use of centralized planning algorithms. However, in this study, we focus on the decentralized MAPF setting, where the agents may observe the other agents only locally and are restricted in communications with each other. Specifically, we investigate the lifelong variant of MAPF, where new goals are continually assigned to the agents upon completion of previous ones. Drawing inspiration from the successful AlphaZero approach, we propose a decentralized multi-agent Monte Carlo Tree Search (MCTS) method for MAPF tasks. Our approach utilizes the agent's observations to recreate the intrinsic Markov decision process, which is then used for planning with a tailored for multi-agent tasks version of neural MCTS. The experimental results show that our approach outperforms state-of-the-art learnable MAPF solvers. The source code is available at https://github.com/AIRI-Institute/mats-lp.
Leading the Pack: N-player Opponent Shaping
Souly, Alexandra, Willi, Timon, Khan, Akbir, Kirk, Robert, Lu, Chris, Grefenstette, Edward, Rocktäschel, Tim
Reinforcement learning solutions have great success in the 2-player general sum setting. In this setting, the paradigm of Opponent Shaping (OS), in which agents account for the learning of their co-players, has led to agents which are able to avoid collectively bad outcomes, whilst also maximizing their reward. These methods have currently been limited to 2-player game. However, the real world involves interactions with many more agents, with interactions on both local and global scales. In this paper, we extend Opponent Shaping (OS) methods to environments involving multiple co-players and multiple shaping agents. We evaluate on over 4 different environments, varying the number of players from 3 to 5, and demonstrate that model-based OS methods converge to equilibrium with better global welfare than naive learning. However, we find that when playing with a large number of co-players, OS methods' relative performance reduces, suggesting that in the limit OS methods may not perform well. Finally, we explore scenarios where more than one OS method is present, noticing that within games requiring a majority of cooperating agents, OS methods converge to outcomes with poor global welfare.
Multi-Task Multi-Agent Shared Layers are Universal Cognition of Multi-Agent Coordination
Wang, Jiawei, Zhao, Jian, Cao, Zhengtao, Feng, Ruili, Qin, Rongjun, Yu, Yang
Multi-agent reinforcement learning shines as the pinnacle of multi-agent systems, conquering intricate real-world challenges, fostering collaboration and coordination among agents, and unleashing the potential for intelligent decision-making across domains. However, training a multi-agent reinforcement learning network is a formidable endeavor, demanding substantial computational resources to interact with diverse environmental variables, extract state representations, and acquire decision-making knowledge. The recent breakthroughs in large-scale pre-trained models ignite our curiosity: Can we uncover shared knowledge in multi-agent reinforcement learning and leverage pre-trained models to expedite training for future tasks? Addressing this issue, we present an innovative multi-task learning approach that aims to extract and harness common decision-making knowledge, like cooperation and competition, across different tasks. Our approach involves concurrent training of multiple multi-agent tasks, with each task employing independent front-end perception layers while sharing back-end decision-making layers. This effective decoupling of state representation extraction from decision-making allows for more efficient training and better transferability. To evaluate the efficacy of our proposed approach, we conduct comprehensive experiments in two distinct environments: the StarCraft Multi-agent Challenge (SMAC) and the Google Research Football (GRF) environments. The experimental results unequivocally demonstrate the smooth transferability of the shared decision-making network to other tasks, thereby significantly reducing training costs and improving final performance. Furthermore, visualizations authenticate the presence of general multi-agent decision-making knowledge within the shared network layers, further validating the effectiveness of our approach.
CARSS: Cooperative Attention-guided Reinforcement Subpath Synthesis for Solving Traveling Salesman Problem
Shi, Yuchen, Han, Congying, Guo, Tiande
This paper introduces CARSS (Cooperative Attention-guided Reinforcement Subpath Synthesis), a novel approach to address the Traveling Salesman Problem (TSP) by leveraging cooperative Multi-Agent Reinforcement Learning (MARL). CARSS decomposes the TSP solving process into two distinct yet synergistic steps: "subpath generation" and "subpath merging." In the former, a cooperative MARL framework is employed to iteratively generate subpaths using multiple agents. In the latter, these subpaths are progressively merged to form a complete cycle. The algorithm's primary objective is to enhance efficiency in terms of training memory consumption, testing time, and scalability, through the adoption of a multi-agent divide and conquer paradigm. Notably, attention mechanisms play a pivotal role in feature embedding and parameterization strategies within CARSS. The training of the model is facilitated by the independent REINFORCE algorithm. Empirical experiments reveal CARSS's superiority compared to single-agent alternatives: it demonstrates reduced GPU memory utilization, accommodates training graphs nearly 2.5 times larger, and exhibits the potential for scaling to even more extensive problem sizes. Furthermore, CARSS substantially reduces testing time and optimization gaps by approximately 50% for TSP instances of up to 1000 vertices, when compared to standard decoding methods.
Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning
Christianos, Filippos, Papoudakis, Georgios, Zimmer, Matthieu, Coste, Thomas, Wu, Zhihao, Chen, Jingxuan, Khandelwal, Khyati, Doran, James, Feng, Xidong, Liu, Jiacheng, Xiong, Zheng, Luo, Yicheng, Hao, Jianye, Shao, Kun, Bou-Ammar, Haitham, Wang, Jun
A key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL). However, constructing a standalone RL policy that maps perception to action directly encounters severe problems, chief among them being its lack of generality across multiple tasks and the need for a large amount of training data. The leading cause is that it cannot effectively integrate prior information into the perception-action cycle when devising the policy. Large language models (LLMs) emerged as a fundamental way to incorporate cross-domain knowledge into AI agents but lack crucial learning and adaptation toward specific decision problems. This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies. Our methodology is motivated by the modularity found in the human brain. The framework utilises the construction of intrinsic and extrinsic functions to add previous understandings of reasoning structures. It also provides the adaptive ability to learn models inside every module or function, consistent with the modular structure of cognitive processes. We describe the framework in-depth and compare it with other AI pipelines and existing frameworks. The paper explores practical applications, covering experiments that show the effectiveness of our method. Our results indicate that AI agents perform and adapt far better when organised reasoning and prior knowledge are embedded. This opens the door to more resilient and general AI agent systems.
A Polarization Opinion Model Inspired by Bounded Confidence Communications
Cyranka, Jacek, Mucha, Piotr B.
We present an opinion model founded upon the principles of the bounded confidence interaction among agents. Our objective is to explain the polarization effects inherent to vector-valued opinions. The evolutionary process adheres to the rule where each agent aspires to increase polarization through communication with a single friend during each discrete time step. The dynamics ensure that agents' ultimate (temporal) configuration will encompass a finite number of outlier states. We introduce deterministic and stochastic models, accompanied by a comprehensive mathematical analysis of their inherent properties. Additionally, we provide compelling illustrative examples and introduce a stochastic solver tailored for scenarios featuring an extensive set of agents. Furthermore, in the context of smaller agent populations, we scrutinize the suitability of neural networks for the rapid inference of limit configurations.
Multi-Agent Probabilistic Ensembles with Trajectory Sampling for Connected Autonomous Vehicles
Wen, Ruoqi, Huang, Jiahao, Li, Rongpeng, Ding, Guoru, Zhao, Zhifeng
Autonomous Vehicles (AVs) have attracted significant attention in recent years and Reinforcement Learning (RL) has shown remarkable performance in improving the autonomy of vehicles. In that regard, the widely adopted Model-Free RL (MFRL) promises to solve decision-making tasks in connected AVs (CAVs), contingent on the readiness of a significant amount of data samples for training. Nevertheless, it might be infeasible in practice and possibly lead to learning instability. In contrast, Model-Based RL (MBRL) manifests itself in sample-efficient learning, but the asymptotic performance of MBRL might lag behind the state-of-the-art MFRL algorithms. Furthermore, most studies for CAVs are limited to the decision-making of a single AV only, thus underscoring the performance due to the absence of communications. In this study, we try to address the decision-making problem of multiple CAVs with limited communications and propose a decentralized Multi-Agent Probabilistic Ensembles with Trajectory Sampling algorithm MA-PETS. In particular, in order to better capture the uncertainty of the unknown environment, MA-PETS leverages Probabilistic Ensemble (PE) neural networks to learn from communicated samples among neighboring CAVs. Afterwards, MA-PETS capably develops Trajectory Sampling (TS)-based model-predictive control for decision-making. On this basis, we derive the multi-agent group regret bound affected by the number of agents within the communication range and mathematically validate that incorporating effective information exchange among agents into the multi-agent learning scheme contributes to reducing the group regret bound in the worst case. Finally, we empirically demonstrate the superiority of MA-PETS in terms of the sample efficiency comparable to MFBL.
Rate-Tunable Control Barrier Functions: Methods and Algorithms for Online Adaptation
Parwana, Hardik, Panagou, Dimitra
Control Barrier Functions offer safety certificates by dictating controllers that enforce safety constraints. However, their response depends on the classK function that is used to restrict the rate of change of the value of the barrier function along the system trajectories. This paper introduces the notion of a Rate-Tunable (RT) CBF, which allows for online tuning of the response of CBF-based controllers. In contrast to existing approaches that use a fixed classK function to ensure safety, we parameterize and adapt the classK function parameters online. We show that this helps improve the system's response in terms of the resulting trajectories being closer to a nominal reference while being sufficiently far from the boundary of the safe set. We provide point-wise sufficient conditions to be imposed on any user-given parameter dynamics so that multiple CBF constraints continue to admit a common control input with time. Finally, we introduce RT-CBF parameter dynamics for decentralized noncooperative multi-agent systems, where a trust factor, computed based on the instantaneous ease of constraint satisfaction, is used to update parameters online for a less conservative response.
Action Duration Generalization for Exact Multi-Agent Collective Construction
This paper addresses exact approaches to multi-agent collective construction problem which tasks a group of cooperative agents to build a given structure in a blocksworld under the gravity constraint. We propose a generalization of the existing exact model based on mixed integer linear programming by accommodating varying agent action durations. We refer to the model as a fraction-time model. The generalization by introducing action duration enables one to create a more realistic model for various domains. It provides a significant reduction of plan execution duration at the cost of increased computational time, which rises steeply the closer the model gets to the exact real-world action duration. We also propose a makespan estimation function for the fraction-time model. This can be used to estimate the construction time reduction size for the purpose of cost-benefit analysis. The fraction-time model and the makespan estimation function have been evaluated in a series of experiments using a set of benchmark structures. The results show a significant reduction of plan execution duration for non-constant duration actions due to decreasing synchronization overhead at the end of each action. According to the results, the makespan estimation function provides a reasonably accurate estimate of the makespan.