Agent Societies
Queue Up Your Regrets: Achieving the Dynamic Capacity Region of Multiplayer Bandits
Consider $N$ cooperative agents such that for $T$ turns, each agent n takes an action $a_{n}$ and receives a stochastic reward $r_{n}\left(a_{1},\ldots,a_{N}\right)$. Agents cannot observe the actions of other agents and do not know even their own reward function. The agents can communicate with their neighbors on a connected graph $G$ with diameter $d\left(G\right)$. We want each agent $n$ to achieve an expected average reward of at least $\lambda_{n}$ over time, for a given quality of service (QoS) vector $\boldsymbol{\lambda}$. A QoS vector $\boldsymbol{\lambda}$ is not necessarily achievable.
On Sample Optimality in Personalized Collaborative and Federated Learning
In personalized federated learning, each member of a potentially large set of agents aims to train a model minimizing its loss function averaged over its local data distribution. We study this problem under the lens of stochastic optimization, focusing on a scenario with a large number of agents, that each possess very few data samples from their local data distribution. Specifically, we prove novel matching lower and upper bounds on the number of samples required from all agents to approximately minimize the generalization error of a fixed agent. We provide strategies matching these lower bounds, based on a gradient filtering approach: given prior knowledge on some notion of distance between local data distributions, agents filter and aggregate stochastic gradients received from other agents, in order to achieve an optimal bias-variance trade-off. Finally, we quantify the impact of using rough estimations of the distances between local distributions of agents, based on a very small number of local samples.
Convergence Guarantees for Federated SARSA with Local Training and Heterogeneous Agents
Mangold, Paul, Berthier, Eloรฏse, Moulines, Eric
We present a novel theoretical analysis of Federated SARSA (FedSARSA) with linear function approximation and local training. We establish convergence guarantees for FedSARSA in the presence of heterogeneity, both in local transitions and rewards, providing the first sample and communication complexity bounds in this setting. At the core of our analysis is a new, exact multi-step error expansion for single-agent SARSA, which is of independent interest. Our analysis precisely quantifies the impact of heterogeneity, demonstrating the convergence of FedSARSA with multiple local updates. Crucially, we show that FedSARSA achieves linear speed-up with respect to the number of agents, up to higher-order terms due to Markovian sampling. Numerical experiments support our theoretical findings.
Norm-Governed Multi-Agent Decision-Making in Simulator-Coupled Environments:The Reinsurance Constrained Multi-Agent Simulation Process (R-CMASP)
Reinsurance decision-making exhibits the core structural properties that motivate multi-agent models: distributed and asymmetric information, partial observability, heterogeneous epistemic responsibilities, simulator-driven environment dynamics, and binding prudential and regulatory constraints. Deterministic workflow automation cannot meet these requirements, as it lacks the epistemic flexibility, cooperative coordination mechanisms, and norm-sensitive behaviour required for institutional risk-transfer. We propose the Reinsurance Constrained Multi-Agent Simulation Process (R-CMASP), a formal model that extends stochastic games and Dec-POMDPs by adding three missing elements: (i) simulator-coupled transition dynamics grounded in catastrophe, capital, and portfolio engines; (ii) role-specialized agents with structured observability, belief updates, and typed communication; and (iii) a normative feasibility layer encoding solvency, regulatory, and organizational rules as admissibility constraints on joint actions. Using LLM-based agents with tool access and typed message protocols, we show in a domain-calibrated synthetic environment that governed multi-agent coordination yields more stable, coherent, and norm-adherent behaviour than deterministic automation or monolithic LLM baselines--reducing pricing variance, improving capital efficiency, and increasing clause-interpretation accuracy. Embedding prudential norms as admissibility constraints and structuring communication into typed acts measurably enhances equilibrium stability. Overall, the results suggest that regulated, simulator-driven decision environments are most naturally modelled as norm-governed, simulator-coupled multi-agent systems.
Distributionally Robust Markov Games with Average Reward
We study distributionally robust Markov games (DR-MGs) with the average-reward criterion, a framework for multi-agent decision-making under uncertainty over extended horizons. In average reward DR-MGs, agents aim to maximize their worst-case infinite-horizon average reward, to ensure satisfactory performance under environment uncertainties and opponent actions. We first establish a connection between the best-response policies and the optimal policies for the induced single-agent problems. Under a standard irreducible assumption, we derive a correspondence between the optimal policies and the solutions of the robust Bellman equation, and derive the existence of stationary Nash Equilibrium (NE) based on these results. We further study DR-MGs under the weakly communicating setting, where we construct a set-valued map and show its value is a subset of the best-response policies, convex and upper hemi-continuous, and derive the existence of NE. We then explore algorithmic solutions, by first proposing a Robust Nash-Iteration algorithm and providing convergence guarantees under some additional assumptions and a NE computing oracle. We further develop a temporal-difference based algorithm for DR-MGs, and provide convergence guarantees without any additional oracle or assumptions. Finally, we connect average-reward robust NE to discounted ones, showing that the average reward robust NE can be approximated by the discounted ones under a large discount factor. Our studies provide a comprehensive theoretical and algorithmic foundation for decision-making in complex, uncertain, and long-running multi-player environments.
Dynamic one-time delivery of critical data by small and sparse UAV swarms: a model problem for MARL scaling studies
Persson, Mika, Lidman, Jonas, Ljungberg, Jacob, Sandelius, Samuel, Andersson, Adam
This work presents a conceptual study on the application of Multi-Agent Reinforcement Learning (MARL) for decentralized control of unmanned aerial vehicles to relay a critical data package to a known position. For this purpose, a family of deterministic games is introduced, designed for scaling studies for MARL. A robust baseline policy is proposed, which is based on restricting agent motion envelopes and applying Dijkstra's algorithm. Experimental results show that two off-the-shelf MARL algorithms perform competitively with the baseline for a small number of agents, but scalability issues arise as the number of agents increase.
On Mobile Ad Hoc Networks for Coverage of Partially Observable Worlds
Meriaux, Edwin, Wen, Shuo, Langevin, Louis-Roy, Precup, Doina, Lorรญa, Antonio, Dudek, Gregory
This paper addresses the movement and placement of mobile agents to establish a communication network in initially unknown environments. We cast the problem in a computational-geometric framework by relating the coverage problem and line-of-sight constraints to the Cooperative Guard Art Gallery Problem, and introduce its partially observable variant, the Partially Observable Cooperative Guard Art Gallery Problem (POCGAGP). We then present two algorithms that solve POCGAGP: CADENCE, a centralized planner that incrementally selects 270 degree corners at which to deploy agents, and DADENCE, a decentralized scheme that coordinates agents using local information and lightweight messaging. Both approaches operate under partial observability and target simultaneous coverage and connectivity. We evaluate the methods in simulation across 1,500 test cases of varied size and structure, demonstrating consistent success in forming connected networks while covering and exploring unknown space. These results highlight the value of geometric abstractions for communication-driven exploration and show that decentralized policies are competitive with centralized performance while retaining scalability.
Distributed scalable coupled policy algorithm for networked multi-agent reinforcement learning
Dai, Pengcheng, Wang, Dongming, Yu, Wenwu, Ren, Wei
This paper studies networked multi-agent reinforcement learning (NMARL) with interdependent rewards and coupled policies. In this setting, each agent's reward depends on its own state-action pair as well as those of its direct neighbors, and each agent's policy is parameterized by its local parameters together with those of its $ฮบ_{p}$-hop neighbors, with $ฮบ_{p}\geq 1$ denoting the coupled radius. The objective of the agents is to collaboratively optimize their policies to maximize the discounted average cumulative reward. To address the challenge of interdependent policies in collaborative optimization, we introduce a novel concept termed the neighbors' averaged $Q$-function and derive a new expression for the coupled policy gradient. Based on these theoretical foundations, we develop a distributed scalable coupled policy (DSCP) algorithm, where each agent relies only on the state-action pairs of its $ฮบ_{p}$-hop neighbors and the rewards of its $(ฮบ_{p}+1)$-hop neighbors. Specially, in the DSCP algorithm, we employ a geometric 2-horizon sampling method that does not require storing a full $Q$-table to obtain an unbiased estimate of the coupled policy gradient. Moreover, each agent interacts exclusively with its direct neighbors to obtain accurate policy parameters, while maintaining local estimates of other agents' parameters to execute its local policy and collect samples for optimization. These estimates and policy parameters are updated via a push-sum protocol, enabling distributed coordination of policy updates across the network. We prove that the joint policy produced by the proposed algorithm converges to a first-order stationary point of the objective function. Finally, the effectiveness of DSCP algorithm is demonstrated through simulations in a robot path planning environment, showing clear improvement over state-of-the-art methods.
Robustness and Adaptability of Reinforcement Learning based Cooperative Autonomous Driving in Mixed-autonomy Traffic
Valiente, Rodolfo, Toghi, Behrad, Pedarsani, Ramtin, Fallah, Yaser P.
HE development of autonomous vehicles (A Vs) is on the verge of passing beyond the laboratory and simulation tests and is shifting towards addressing the challenges that limit their practicality in today's society. While there is still need for further technological improvements to enable safe and smooth operation of a single A V, a great deal of research attention is being focused on the emerging challenge of operating multiple A Vs and the co-existence of A Vs and human-driven vehicles (HVs) [1], [2]. A realistic outlook for the adoption of autonomous vehicles on the roads is a mixed-traffic scenario in which human drivers with different driving styles and social preferences share the road with A Vs that are perhaps built by different manufacturers and hence follow different policies [3], [4]. In this work, we seek a solution that can ensure the safety and robustness of A Vs in the presence of human drivers with heterogeneous behavioral traits. Connected & autonomous vehicles (CA Vs) via vehicle-to-vehicle (V2V) communication allow vehicles to directly communicate with their neighbors, creating an extended perception that enables explicit coordination among vehicles to overcome the limitations of an isolated agent [5]-[11]. While planning in a fully A V scenario is relatively easy to achieve, coordination in the presence of HVs is a significantly more challenging task, as the A Vs not only need to react to road objects but also need to consider the behaviors of HVs [3], [4], [12]. We start by identifying the major challenges in the domain of behavior planning and prediction for A Vs in mixed-autonomy traffic.
LLM Collaboration With Multi-Agent Reinforcement Learning
Liu, Shuo, Chen, Tianle, Liang, Zeyu, Lyu, Xueguang, Amato, Christopher
A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. For example, existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address this challenge, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning multiple LLMs with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using MARL methods for LLM collaboration and highlights the associated challenges.