Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems

Open in new window