DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models

Chen, Ruizhe, Chai, Wenhao, Yang, Zhifei, Zhang, Xiaotian, Zhou, Joey Tianyi, Quek, Tony, Poria, Soujanya, Liu, Zuozhu

arXiv.org Artificial Intelligence 

The alignment of large language models (LLMs) with human preferences has recently emerged as a focal area of research [53, 62]. Prominent techniques such as Reinforcement Learning from Human Feedback (RLHF) [47] and Direct Preference Optimization (DPO) [50] have demonstrated substantial efficacy. However, these methods require the optimization of individual policies, posing challenges such as high consumption of training resources. Inference-time alignment [27, 45] provides an efficient alternative through direct adjustment of the model's output distribution, thus avoiding the need for resource-intensive retraining. Despite its advantages, this approach still requires policy-specific value functions, limiting its scalability across different models. Additionally, the inference-time latency remains high, presenting further challenges to its practical deployment. In this paper, we investigate an efficient and policy-agnostic preference optimization method. We begin by reconsidering the objective of aligning with humans [53, 65]. As illustrated in Figure 1(a), the alignment process operates at the sentence level, focusing on adjusting key components of the generated content, such as style or format, to better reflect human intentions or values.