DAPO: An Open-Source LLMReinforcement Learning System at Scale
–Neural Information Processing Systems
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm, and fully opensource a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B
Neural Information Processing Systems
Jun-21-2026, 04:16:33 GMT