Learning Optimal Advantage from Preferences and Mistaking it for Reward

Open in new window