SimPO: Simple Preference Optimization with a Reference-Free Reward Y u Meng

Neural Information Processing Systems 

Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further improving the algorithm's performance.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found