Mitigating Preference Hacking in Policy Optimization with Pessimism