Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

Open in new window