Goto

Collaborating Authors

 Asia










Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Neural Information Processing Systems

Reinforcement Learning from Human Feedback (RLHF) has been pivotal in aligning Large Language Models with human values but often suffers from overopti-mization due to its reliance on a proxy reward model.