Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Open in new window