Provably Mitigating Overoptimization in RLHF: Y our SFT Loss is Implicitly an Adversarial Regularizer

Open in new window