One Goal, Many Challenges: Robust Preference Optimization Amid Content-Aware and Multi-Source Noise

Afzali, Amirabbas, Afsharrad, Amirhossein, Mousavi, Seyed Shahabeddin, Lall, Sanjay

arXiv.org Artificial Intelligence 

Large Language Models (LLMs) have made significant strides in generating human-like responses, largely due to preference alignment techniques. However, these methods often assume unbiased human feedback, which is rarely the case in real-world scenarios. This paper introduces Content-Aware Noise-Resilient Preference Optimization (CNRPO), a novel framework that addresses multiple sources of content-dependent noise in preference learning. CNRPO employs a multi-objective optimization approach to separate true preferences from contentaware noises, effectively mitigating their impact. We leverage backdoor attack mechanisms to efficiently learn and control various noise sources within a single model. Theoretical analysis and extensive experiments on different synthetic noisy datasets demonstrate that CNRPO significantly improves alignment with primary human preferences while controlling for secondary noises and biases, such as response length and harmfulness. Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability through preference alignment techniques, primarily using Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2023; Stiennon et al., 2022; Ouyang et al., 2022). However, RLHF faces challenges like reward model misgeneralization and training instability (Touvron et al., 2023; Casper et al., 2023). To address these issues, ranking-based methods like Direct Preference Optimization (DPO) (Rafailov et al., 2024) and Identity Preference Optimization (IPO) (Azar et al., 2023) have been developed, bypassing explicit reward modeling. While these approaches have advanced LLM capabilities, they assume unbiased human feedback. In reality, annotations can be influenced by various biases, such as a preference for longer responses or a focus on safety (Park et al., 2024b; Wang et al., 2024).