Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jun-14-2026, 07:02:40 GMT–Neural Information Processing Systems

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency--requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline.

large language model, machine learning, natural language, (9 more...)

Neural Information Processing Systems

Jun-14-2026, 07:02:40 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (0.73)
    - Chatbot (0.73)
  - Machine Learning > Neural Networks
    - Deep Learning (0.73)