Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Rafael Rafailov Stanford University

Neural Information Processing Systems 

A prominent issue with such methods is reward over-optimization or reward hacking, where performance as measured by the learned proxy reward model increases, but true quality plateaus or even deteriorates.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found