Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Neural Information Processing Systems 

However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model-a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found