A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
Xiong, Wei, Yao, Jiarui, Xu, Yuhui, Pang, Bo, Wang, Lei, Sahoo, Doyen, Li, Junnan, Jiang, Nan, Zhang, Tong, Xiong, Caiming, Dong, Hanze
We investigate reinforcement learning (RL) algorithms in the context of fine-tuning large language models (LLMs) with verifiable rewards. Our focus is on mathematical reasoning tasks, which have recently received significant attention following the release of models such as OpenAI's O1 Model (Jaech et al., 2024) and DeepSeek-R1 (DeepSeek-AI et al., 2025). The dominant approach in LLM post-training has been Proximal Policy Optimization (PPO) (Schulman et al., 2017; Bai et al., 2022; Ouyang et al., 2022). However, PPO requires an additional critic network beyond the vanilla Reinforce algorithm (Williams and Peng, 1991), introducing both computational overhead and algorithmic complexity. Meanwhile, the deterministic transition nature of LLM also simplifies the problem with a relatively lower variance, many of PPO's sophisticated components may be unnecessary in this setting. This observation has inspired growing interest in designing simpler yet effective RL algorithms for post-training LLMs. Several recent works revisit Reinforce-style approaches, including ReMax (Li et al., 2023), RLOO (Ahma-dian et al., 2024; Kool et al., 2019), GRPO (Shao et al., 2024), and Reinforce++ (Hu, 2025). In parallel, other methods explore different directions beyond policy gradients. Reward-ranked fine-tuning (RAFT) (Anthony et al., 2017; Dong et al., 2023) iteratively generates n responses per prompt, filter out those with incorrect answers, and fine-tune the LLM on the remaining accepted samples.
Apr-15-2025
- Country:
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.46)
- Technology: