Goto

Collaborating Authors

 joint policy


AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation

Neural Information Processing Systems

One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To avoid this curse of dimensionality, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, even when combined with standard conservatism principles, these methods can still result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.


CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

arXiv.org Machine Learning

Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introduce CODA (Coordination via On-Policy Diffusion for Multi-Agent Reinforcement Learning), a diffusion-based multi-agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co-adaptation. We find that previous diffusion-based augmentation approaches are insufficient for fostering multi-agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on-policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm-agnostic and can be layered onto both model-free and model-based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous-control benchmarks.


Details

Neural Information Processing Systems

A.1 Difference between the performance of two joint policies In Section 3.1, the difference between the performance of two joint policies is expressed as follows: The proof is a multi-agent version of the proof in (Kakade and Langford, 2002). Now we provide the mathematical detail formally. A.2 Approximation that matches the true value to first order In Section 3.1, we claim that Jฯ€( ฯ€) matches J( ฯ€) to first order. Intuitively, this means that a sufficiently small update of the joint policy which improves Jฯ€( ฯ€) will also improve J( ฯ€). Now we prove it formally.


Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

Neural Information Processing Systems

Measuring and promoting policy diversity is critical for solving games with strong non-transitive dynamics where strategic cycles exist, and there is no consistent winner (e.g., Rock-Paper-Scissors). With that in mind, maintaining a pool of diverse policies via open-ended learning is an attractive solution, which can generate auto-curricula to avoid being exploited. However, in conventional open-ended learning algorithms, there are no widely accepted definitions for diversity, making it hard to construct and evaluate the diverse policies. In this work, we summarize previous concepts of diversity and work towards offering a unified measure of diversity in multi-agent open-ended learning to include all elements in Markov games, based on both Behavioral Diversity (BD) and Response Diversity (RD).





A Algorithm

Neural Information Processing Systems

This section consists of three parts, with each subsequent part building upon the previous one. Appendix A.1 covers the fundamentals of RL, where the actor-critic method is introduced. Appendix A.2 describes the RL algorithm for a single fulfillment agent, which is the proximal policy Appendix A.3 presents the MARL algorithm for the Currently, policy-based methods [Deisenroth et al., 2013] are prevalent because they are compatible with stochastic To sum up, the complete procedure is given in Algorithm 1.Algorithm 1 Heterogeneous Multi-Agent Reinforcement Learning for Order Fulfillment. With regard to the advantage estimator, we set the GAE parameters [Schulman et al., 2016] To highlight how our proposed benchmark differs from existing approaches focused on sub-tasks of order fulfillment, we compare the objectives, observations, and actions in Table 1. It should be noted that multiple formulations exist for each sub-task.