Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Open in new window