SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning
Rahman, Salman, Gorantla, Sruthi, Gupta, Arpit, Roy, Swastik, Peng, Nanyun, Liu, Yang
–arXiv.org Artificial Intelligence
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision--achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT -4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth. Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, from achieving gold-medal performance at the International Mathematical Olympiad to autonomous agentic coding (Castelvecchi, 2025; Luong & Lockhart, 2025; Y ang et al., 2024b; Hurst et al., 2024; Anthropic, 2025). Recent breakthroughs like OpenAI's o1 and DeepSeek's R1 demonstrate that reinforcement learning (RL) post-training can significantly enhance reasoning capabilities beyond supervised fine-tuning alone (Jaech et al., 2024; Guo et al., 2025), as RL enables models to explore diverse solution paths and learn from feedback rather than imitation (Chu et al., 2025). While RL post-training shows promise, current approaches rely on verifiers that require ground truth references. Traditional methods rely on either discriminative verifiers that provide binary correctness signals (Cobbe et al., 2021) or rule-based verifiers using exact answer matching (RL VR) (Guo et al., 2025; Hu et al., 2025), both offering only sparse, outcome-level rewards. Recent advances introduce Process Reward Models (PRMs) that provide denser, step-level feedback to improve training stability and credit assignment (Lightman et al., 2023; Wang et al., 2024; Uesato et al., 2022), including co-evolving approaches like T ANGO (Zha et al., 2025) and PRIME (Y uan et al., 2024) that jointly train the verifier alongside the policy model. Work done while as an intern at Amazon AGI. Stage III: Apply trained PRMs in RL with GRPO using different reward designs. PRIME requires outcome-level correctness labels to train its PRM (Zha et al., 2025; Y uan et al., 2024).
arXiv.org Artificial Intelligence
Dec-4-2025
- Country:
- Asia
- Middle East > Jordan (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe > Austria
- Vienna (0.14)
- North America > United States (0.04)
- Asia
- Genre:
- Research Report (0.64)
- Industry:
- Education (1.00)
- Technology: