Negative-Prompt-driven Alignment for Generative Language Model
Qiao, Shiqi, Xv, Ning, Liu, Biao, Geng, Xin
–arXiv.org Artificial Intelligence
Their vast parameters (Kaplan et al., 2020) and extensive training data grant them strong capabilities, but they may still generate outputs that conflict with human values, such as helpless or harmful content. Therefore, AI alignment research has emerged with the goal of fine-tuning LLMs to make them align with human values. One of the most popular alignment methods is RLHF(Reinforcement Learning from Human Feedback) framework (Stiennon et al., 2020; Ziegler et al., 2019; Ouyang et al., 2022), which initially apply supervised fine-tuning to the base model to follow human instructions. Subsequently, a reward model is trained from the human preference data, then optimizing the LLM via PPO algorithm (Schulman et al., 2017) to align with huamn preferences. RLHF requires at least three large models for training, making the process quite complex, and the PPO algorithm itself is highly sophisticated and challenging to parameter-tuning. This drives researchers to explore simpler and more straightforward methods to align language models with human preferences. To simplify alignment, (Rafailov et al., 2023) introduced Direct Preference Optimization (DPO), which provides a closed-form alignment solution and directly uses human preferences for alignment without a separate reward model. Other approaches, like RRHF (Yuan et al., 2023a) and PRO (Song et al., 2024), use SFT-like loss based on multi-ranking datasets to provide richer supervision for alignment.
arXiv.org Artificial Intelligence
Oct-15-2024