Bridging Offline and Online Reinforcement Learning for LLMs

Lanchantin, Jack, Chen, Angelica, Lan, Janice, Li, Xian, Saha, Swarnadeep, Wang, Tianlu, Xu, Jing, Yu, Ping, Yuan, Weizhe, Weston, Jason E, Sukhbaatar, Sainbayar, Kulikov, Ilia

Jun-27-2025–arXiv.org Artificial Intelligence

We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Jun-27-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Florida > Miami-Dade County > Miami (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Education > Educational Setting > Online (0.30)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks > Deep Learning (0.96)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found