Optimization
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model-a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error.
TreeVI: ReparameterizableTree-structured VariationalInferenceforInstance-level CorrelationCapturing
Mean-field variational inference (VI) iscomputationally scalable, but its highlydemanding independence requirement hinders it from being applied to wider scenarios. Although many VI methods that take correlation into account have been proposed, these methods generally are not scalable enough to capture the correlation among data instances, which often arises in applications involving graphs or explicit constraints among instances.
5631e6ee59a4175cd06c305840562ff3-Paper.pdf
Ateachtimestepoftheepisode,thelearnerobserves the current state of the environment, chooses one of theK available actions, and earns a reward. Consequently, the state of the environment changes according to the transition function of the underlying MDP, as a function of the previous state and the action taken by the learner.