A Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference

Open in new window