Mitigating Preference Hacking in Policy Optimization with Pessimism

Open in new window