Towards Understanding Self-play for LLM Reasoning
Chae, Justin Yang, Alam, Md Tanvirul, Rastogi, Nidhi
–arXiv.org Artificial Intelligence
Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.
arXiv.org Artificial Intelligence
Nov-3-2025
- Country:
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States
- New York > Monroe County
- Rochester (0.04)
- Washington > King County
- Seattle (0.04)
- New York > Monroe County
- Europe > Italy
- Genre:
- Research Report > New Finding (0.88)
- Technology: