Optimization
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model-a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error.
Functional Bilevel Optimization for Machine Learning
Bilevel optimization methods solve problems with hierarchical structures, optimizing two interdependent objectives: an inner-level objective and an outer-level one. Initially used in machine learning for model selection [Bennett et al., 2006] and sparse feature learning [Mairal et al., 2012], these methods gained popularity as efficient alternatives to grid search for hyper-parameter tuning [Feurer and Hutter,