Preference-based Reinforcement Learning with Finite-Time Guarantees

Neural Information Processing Systems 

We first show that a unique optimal policy may not exist if preferences over trajectories are deterministic for PbRL.