Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Open in new window