Gibbs Sampling from Human Feedback: A Provable KL- constrained Framework for RLHF

Open in new window