Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs 2 Yong Lin

Mar-22-2025, 13:00:49 GMT–Neural Information Processing Systems

Reward models trained on human preference data have been proven to effectively align Large Language Models (LLMs) with human intent within the framework of reinforcement learning from human feedback (RLHF). However, current reward models have limited generalization capabilities to unseen prompts and responses, which can lead to an unexpected phenomenon known as reward over-optimization, resulting in a decline in actual performance due to excessive optimization of rewards. While previous research has advocated for constraining policy optimization, our study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities, while concurrently learning a reward head behind the same hidden states.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Mar-22-2025, 13:00:49 GMT

Conferences PDF

Add feedback

Country:
- North America > United States
  - Illinois (0.14)
  - Tennessee (0.14)

Genre:
- Research Report
  - Experimental Study (0.93)
  - New Finding (1.00)

Industry:
- Leisure & Entertainment (0.93)
- Media > Music (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found