Reward Model Overoptimisation in Iterated RLHF

Open in new window