self-play
Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems
Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r> .7,
Guarantees for Self-Play in Multiplayer Games via Polymatrix Decomposability
Self-play is a technique for machine learning in multi-agent systems where a learning algorithm learns by interacting with copies of itself. Self-play is useful for generating large quantities of data for learning, but has the drawback that the agents the learner will face post-training may have dramatically different behavior than the learner came to expect by interacting with itself. For the special case of two-player constant-sum games, self-play that reaches Nash equilibrium is guaranteed to produce strategies that perform well against any post-training opponent; however, no such guarantee exists for multiplayer games. We show that in games that approximately decompose into a set of two-player constant-sum games (called constant-sum polymatrix games) where global $\epsilon$-Nash equilibria are boundedly far from Nash equilibria in each subgame (called subgame stability), any no-external-regret algorithm that learns by self-play will produce a strategy with bounded vulnerability. For the first time, our results identify a structural property of multiplayer games that enable performance guarantees for the strategies produced by a broad class of self-play algorithms. We demonstrate our findings through experiments on Leduc poker.
Near-Optimal Reinforcement Learning with Self-Play
This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires \tlO(S^2AB) steps of game playing, when only highlighting the dependency on (S,A,B). In contrast, the best existing lower bound scales as \Omega(S(A+B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity \tlO(SAB), and a new Nash V-learning algorithm with sample complexity \tlO(S(A+B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.
Towards Understanding Self-play for LLM Reasoning
Chae, Justin Yang, Alam, Md Tanvirul, Rastogi, Nidhi
Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.
- North America > United States > New York > Monroe County > Rochester (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
SPICE: Self-Play In Corpus Environments Improves Reasoning
Liu, Bo, Jin, Chuanyang, Kim, Seungone, Yuan, Weizhe, Zhao, Wenting, Kulikov, Ilia, Li, Xian, Sukhbaatar, Sainbayar, Lanchantin, Jack, Weston, Jason
Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.
- North America > United States > Virginia (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Singapore (0.04)
- (2 more...)
Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems
The paper attempts to move away from traditional evaluation of open-domain dialog systems (i.e., judge response given its conversation history) and moves towards a more interactive one (i.e., human talking to a bot), which is likely an important step towards better evaluation. However, I do have several serious concerns about this work in its current form: (1) The authors contrast their work with existing evaluation for open-domain dialog evaluation, which they call "single-turn" evaluation. They point out that this type of evaluation prevents it from capturing "failure modes […] such as a lack of diversity in the responses, inability to track long-term aspects of the conversation". I think this is rather misleading and the term is "single-turn" is a misnomer. Most previous work has indeed evaluated each conversation by factorizing it into a sequence of independent turn-level judgments, but each of these judgments assesses the quality of the current turn T_n **given** a history of several previous turns …, T_n-k, … T_n-1.
Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems
This paper explores interesting directions, in particular 1) using interactive settings to evaluate a model rather than a single answer, and 2) combining different automated metrics in a weighted sums to approximate human evaluation (e.g., based on sentiment). Reviewers have raised crucial points, regarding gameability (so that using the metrics for training a model is tricky if not followed by a non-gameable evaluation), and lack of comparability between different self-play. It's indeed a much better evaluation setting if the system does not control both sides (e.g., models being matched to the same set of fixed models), so authors should definitely follow that direction. However, I expect this work would still be interesting to the dialog community: many of the diagnostic advantages of the model-talking-to-model setting remain, in practice, especially because the model is in fact not trained with the self-play objective, but that criterion is only used post hoc (so the system can't extensively exploit it during training). In practice, a lot of the problems of the generations of a given model already show up during self-play, and the reasonable worry raised by reviewers that the model could exploit the metric remains theoretical at the moment.
Differentially Private Reinforcement Learning with Self-Play
We study the problem of multi-agent reinforcement learning (multi-agent RL) with differential privacy (DP) constraints. This is well-motivated by various real-world applications involving sensitive data, where it is critical to protect users' private information. We first extend the definitions of Joint DP (JDP) and Local DP (LDP) to two-player zero-sum episodic Markov Games, where both definitions ensure trajectory-wise privacy protection. Then we design a provably efficient algorithm based on optimistic Nash value iteration and privatization of Bernstein-type bonuses. The algorithm is able to satisfy JDP and LDP requirements when instantiated with appropriate privacy mechanisms.
Robust Autonomy Emerges from Self-Play
Cusumano-Towner, Marco, Hafner, David, Hertzberg, Alex, Huval, Brody, Petrenko, Aleksei, Vinitsky, Eugene, Wijmans, Erik, Killian, Taylor, Bowers, Stuart, Sener, Ozan, Krähenbühl, Philipp, Koltun, Vladlen
Self-play has powered breakthroughs in two-player and multi-player games. Here we show that self-play is a surprisingly effective strategy in another domain. We show that robust and naturalistic driving emerges entirely from self-play in simulation at unprecedented scale -- 1.6~billion~km of driving. This is enabled by Gigaflow, a batched simulator that can synthesize and train on 42 years of subjective driving experience per hour on a single 8-GPU node. The resulting policy achieves state-of-the-art performance on three independent autonomous driving benchmarks. The policy outperforms the prior state of the art when tested on recorded real-world scenarios, amidst human drivers, without ever seeing human data during training. The policy is realistic when assessed against human references and achieves unprecedented robustness, averaging 17.5 years of continuous driving between incidents in simulation.
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- North America > United States > Michigan (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (4 more...)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Ground > Road (1.00)
- Leisure & Entertainment > Games (1.00)
- Automobiles & Trucks (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Review for NeurIPS paper: Near-Optimal Reinforcement Learning with Self-Play
Additional Feedback: *) Is there a reason to mention algorithm 1? it seems algorithm 2 gives improved performance relatively to it. If so, why presenting the two algorithms and not just algorithm 2? *) Although equation 9 can be thought of as a set of n m linear constraints, why the optimization problem is always feasible? Although the authors devoted half a page to explain on this procedure, I feel it is not well explained. Most of the discussion is not devoted to explaining the policy certification procedure. Why for a fixed \mu the best response is not markovian?