Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback

Sep-29-2025–arXiv.org Machine Learning

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language tasks, yet aligning their behavior with human preferences remains a central challenge. A widely adopted solution is reinforcement learning with human feedback (RLHF), which fine-tunes a pretrained LLM using human preference data (Bai et al., 2022; Christiano et al., 2017; Ziegler et al., 2019). The standard RLHF pipeline involves three stages: (i) supervised fine-tuning (SFT) on human-written demonstrations to produce a baseline model; (ii) training a reward model from human preference comparisons (Bradley and Terry, 1952); and (iii) optimizing the LLM with reinforcement learning against the learned reward. This framework has been instrumental in the success of instruction-following LLMs such as InstructGPT (Ouyang et al., 2022) and ChatGPT (OpenAI, 2023), enabling models to produce responses that are more helpful, safe, and aligned with human expectations. Despite this progress, most existing RLHF implementations are offline (Azar et al., 2024; Rafailov et al., 2024; Zhao et al., 2023): the preference data is collected once from static policies, and the reward model is trained on this fixed dataset (Ivison et al., 2023; Shi et al., 2025; Zhu et al., 2024). While effective, offline RLHF has inherent limitations--It cannot adaptively explore the enormous space of natural language, leading to inefficient use of expensive human feedback. In contrast, online RLHF offers a more powerful alternative: the policy iteratively collects new preference data, updates the reward model, and improves itself based on these updates (Chen et al., 2024; Dong et al., 2024; Feng et al., 2025; Guo et al., 2024; Rosset et al., 2024; Xiong et al., 2023).

cal, exp, human preference, (14 more...)

arXiv.org Machine Learning

Sep-29-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Wisconsin > Dane County > Madison (0.14)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East > Jordan (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found