β-DPO: Direct Preference Optimization with Dynamic β Junkang Wu1 Zhengyi Yang 1 Jiancan Wu1

Mar-27-2025, 13:32:57 GMT–Neural Information Processing Systems

Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter β, as well as to the quality of the preference data. We analyze the impact of β and data quality on DPO, uncovering that optimal β values vary with the informativeness of pairwise data. Addressing the limitations of static β values, we introduce a novel framework that dynamically calibrates β at the batch level, informed by data quality considerations. Additionally, our method incorporates β-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic β adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Mar-27-2025, 13:32:57 GMT

Conferences PDF

Add feedback

Genre:
- Research Report
  - Experimental Study (0.93)
  - New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)