Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems