Query-Policy Misalignment in Preference-Based Reinforcement Learning

Open in new window