diplomacy
No-Press Diplomacy: Modeling Multi-Agent Gameplay
Diplomacy is a seven-player non-stochastic, non-cooperative game, where agents acquire resources through a mix of teamwork and betrayal. Reliance on trust and coordination makes Diplomacy the first non-cooperative multi-agent benchmark for complex sequential social dilemmas in a rich environment. In this work, we focus on training an agent that learns to play the No Press version of Diplomacy where there is no dedicated communication channel between players.
Learning to Play No-Press Diplomacy with Best Response Policy Iteration
Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.
The age of unipolar diplomacy is coming to an end
What is a Palestinian without olives? In Gaza, the world has seen the cost of a diplomacy that claims to uphold a rules-based order but applies it selectively. The United States intervened late, and only to defend an occupation the International Court of Justice (ICJ) has ruled illegal. Alongside other Western nations that built multilateral institutions, the US increasingly pursues nationalist agendas that undermine them. The hypocrisy is stark: one set of rules for Ukraine, another for Gaza.
- North America > United States (0.91)
- Asia > Middle East > Palestine > Gaza Strip > Gaza Governorate > Gaza (0.52)
- Europe > Ukraine (0.25)
- (11 more...)
- Government (1.00)
- Law > International Law (0.90)
Richelieu: Self-Evolving LLM-Based Agents for AI Diplomacy
Diplomacy is one of the most sophisticated activities in human society, involving complex interactions among multiple parties that require skills in social reasoning, negotiation, and long-term strategic planning. Previous AI agents have demonstrated their ability to handle multi-step games and large action spaces in multi-agent tasks. However, diplomacy involves a staggering magnitude of decision spaces, especially considering the negotiation stage required. While recent agents based on large language models (LLMs) have shown potential in various applications, they still struggle with extended planning periods in complex multi-agent settings. Leveraging recent technologies for LLM-based agents, we aim to explore AI's potential to create a human-like agent capable of executing comprehensive multi-agent missions by integrating three fundamental capabilities: 1) strategic planning with memory and reflection; 2) goal-oriented negotiation with social reasoning; and 3) augmenting memory through self-play games for self-evolution without human in the loop.
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Government (1.00)
- Leisure & Entertainment > Games > Computer Games (0.46)
- North America > Canada > Quebec > Montreal (0.14)
- Europe > United Kingdom > England (0.04)
- Europe > Russia (0.04)
- (9 more...)
dataset release, tournament evaluation, architectural design, input representation, and other insights
We want to thank the reviewers for their helpful comments. The dataset will be made available to any interested researchers. We agree with R3 that there are a lot of non-trivial modeling choices in our architecture. We call the first one unit-based and the latter token-based. We apologize for writing some of the claims without referring to the evidence, like "orders from the last movement Our input representation is a result of both empirical findings and domain knowledge.
7 Checklist
For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Y es] We release the code and the models If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [Y es] We included the instructions given to participants in appendix F. In this appendix, we describe the neural network architecture used for our agents.Figure 2: Transformer encoder (left) used in both policy proposal network (center) and value network (right). Our model architecture is shown in Figure 2. It is essentially identical to the architecture in [11], except that it replaces the specialized graph-convolution-based encoder with a much simpler transformer encoder, removes all dropout layers, and uses separate policy and value networks. Aside from the encoder, the other aspects of the architecture are the same, notably the LSTM policy decoder, which decodes orders through sequential attention over each successive location in the encoder output to produce an action. The input to our new encoder is also identical to that of [11], consisting of the same representation of the current board state, previous board state, and a recent order embedding. Rather than processing various parts of this input in two parallel trunks before combining them into a shared encoder trunk, we take the simpler approach of concatenating all features together at the start, resulting in 146 feature channels across each of 81 board locations (75 region + 6 coasts). We pass this through a linear layer, add pointwise a learnable per-position per-channel bias, and then pass this to a standard transformer encoder architecture.