SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Xiao, Teng, Yuan, Yige, Chen, Zhengyu, Li, Mingxiao, Liang, Shangsong, Ren, Zhaochun, Honavar, Vasant G
–arXiv.org Artificial Intelligence
Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameterfree preference optimization algorithm for alignment. We observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER (Simple alignment with Perplexity optimization), is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches--even without any hyperparameters or a reference model. For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. Learning from preference data plays a crucial role in fine-tuning large language models to ensure that pretrained LLMs are aligned with human or societal values and preferences (Bai et al., 2022; Ouyang et al., 2022; Stiennon et al., 2020). In recent years, reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022; Christiano et al., 2017) has been proposed for fine-tuning language models based on human preferences. In the RLHF pipeline (Ouyang et al., 2022), a reward model is first fit to a dataset of human preferences in the form of a classifier between chosen and rejected responses. Next, an LLM policy is trained using RL algorithms such as proximal policy optimization (PPO) (Schulman et al., 2017) to generate responses given the input prompts with high reward. While RLHF produces models with impressive capabilities across diverse tasks, ranging from programming to creative writing, it introduces notable complexities into the training process (Engstrom et al., 2020; Rafailov et al., 2024), involving inefficient and unstable optimization, as well as training on separate reward and policy models.
arXiv.org Artificial Intelligence
Feb-17-2025
- Country:
- North America > United States (0.28)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Technology: