Confronting Reward Model Overoptimization with Constrained RLHF

Open in new window