Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Neural Information Processing Systems 

Reinforcement Learning from Human Feedback (RLHF) has been pivotal in aligning Large Language Models with human values but often suffers from overopti-mization due to its reliance on a proxy reward model.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found