Provably Mitigating Overoptimization in RLHF: Y our SFT Loss is Implicitly an Adversarial Regularizer

Neural Information Processing Systems 

Then it fine-tunes the LLM to maximize the learned reward using RL techniques.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found