Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator

Open in new window