Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

May-27-2025, 19:50:32 GMT–Neural Information Processing Systems

Reinforcement Learning from Human Feedback (RLHF)has been crucial to the recent success of Large Language Models (LLMs), however it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimized the LLM. A prominent issue with such methods is reward over-optimization or reward hacking, where the performance as measured by the learned proxy reward model increases, but the true model quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs), such as Direct Preference Optimization (DPO) have emerged as alternatives to the classical RLHF pipeline. However, despite not training a separate proxy reward model or using RL, they still commonly deteriorate from over-optimization.

direct alignment algorithm, reward model overoptimization, scaling law, (2 more...)

Neural Information Processing Systems

May-27-2025, 19:50:32 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.85)
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Reinforcement Learning (0.85)