A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

Xiong, Wei, Yao, Jiarui, Xu, Yuhui, Pang, Bo, Wang, Lei, Sahoo, Doyen, Li, Junnan, Jiang, Nan, Zhang, Tong, Xiong, Caiming, Dong, Hanze

Apr-15-2025–arXiv.org Machine Learning

We investigate reinforcement learning (RL) algorithms in the context of fine-tuning large language models (LLMs) with verifiable rewards. Our focus is on mathematical reasoning tasks, which have recently received significant attention following the release of models such as OpenAI's O1 Model (Jaech et al., 2024) and DeepSeek-R1 (DeepSeek-AI et al., 2025). The dominant approach in LLM post-training has been Proximal Policy Optimization (PPO) (Schulman et al., 2017; Bai et al., 2022; Ouyang et al., 2022). However, PPO requires an additional critic network beyond the vanilla Reinforce algorithm (Williams and Peng, 1991), introducing both computational overhead and algorithmic complexity. Meanwhile, the deterministic transition nature of LLM also simplifies the problem with a relatively lower variance, many of PPO's sophisticated components may be unnecessary in this setting. This observation has inspired growing interest in designing simpler yet effective RL algorithms for post-training LLMs. Several recent works revisit Reinforce-style approaches, including ReMax (Li et al., 2023), RLOO (Ahma-dian et al., 2024; Kool et al., 2019), GRPO (Shao et al., 2024), and Reinforce++ (Hu, 2025). In parallel, other methods explore different directions beyond policy gradients. Reward-ranked fine-tuning (RAFT) (Anthony et al., 2017; Dong et al., 2023) iteratively generates n responses per prompt, filter out those with incorrect answers, and fine-tune the LLM on the remaining accepted samples.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Machine Learning

Apr-15-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre:
- Research Report (1.00)

Industry:
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found