Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
–Neural Information Processing Systems
However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model-a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error.
Neural Information Processing Systems
Oct-9-2025, 19:56:40 GMT
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States (0.04)
- Asia > Middle East
- Genre:
- Research Report > Experimental Study (0.93)
- Industry:
- Banking & Finance (0.46)
- Energy (0.46)
- Information Technology (0.46)
- Technology: