Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

Open in new window