Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs 2 Yong Lin
–Neural Information Processing Systems
Reward models trained on human preference data have been proven to effectively align Large Language Models (LLMs) with human intent within the framework of reinforcement learning from human feedback (RLHF). However, current reward models have limited generalization capabilities to unseen prompts and responses, which can lead to an unexpected phenomenon known as reward over-optimization, resulting in a decline in actual performance due to excessive optimization of rewards. While previous research has advocated for constraining policy optimization, our study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities, while concurrently learning a reward head behind the same hidden states.
Neural Information Processing Systems
Mar-22-2025, 13:00:49 GMT
- Country:
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (1.00)
- Research Report
- Industry:
- Leisure & Entertainment (0.93)
- Media > Music (1.00)
- Technology: