Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment

Wang, Mingzhi, Ma, Chengdong, Chen, Qizhi, Meng, Linjian, Han, Yang, Xiao, Jiancong, Zhang, Zhaowei, Huo, Jing, Su, Weijie J., Yang, Yaodong

Dec-20-2024–arXiv.org Artificial Intelligence

Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM) performance but also overcomes the limitations of traditional Bradley-Terry (BT) model assumptions by finding the Nash equilibrium (NE) of a preference-based, two-player constant-sum game. However, existing methods either guarantee only average-iterate convergence, incurring high storage and inference costs, or converge to the NE of a regularized game, failing to accurately reflect true human preferences. In this paper, we introduce Magnetic Preference Optimization (MPO), a novel approach capable of achieving last-iterate convergence to the NE of the original game, effectively overcoming the limitations of existing methods. Building upon Magnetic Mirror Descent (MMD), MPO attains a linear convergence rate, making it particularly suitable for fine-tuning LLMs. To ensure our algorithm is both theoretically sound and practically viable, we present a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Dec-20-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia (0.04)
  - Pennsylvania (0.04)
- Asia > China
  - Jiangsu Province > Nanjing (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Leisure & Entertainment > Games (0.92)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)