Negative-Prompt-driven Alignment for Generative Language Model

Qiao, Shiqi, Xv, Ning, Liu, Biao, Geng, Xin

Oct-15-2024–arXiv.org Artificial Intelligence

Their vast parameters (Kaplan et al., 2020) and extensive training data grant them strong capabilities, but they may still generate outputs that conflict with human values, such as helpless or harmful content. Therefore, AI alignment research has emerged with the goal of fine-tuning LLMs to make them align with human values. One of the most popular alignment methods is RLHF(Reinforcement Learning from Human Feedback) framework (Stiennon et al., 2020; Ziegler et al., 2019; Ouyang et al., 2022), which initially apply supervised fine-tuning to the base model to follow human instructions. Subsequently, a reward model is trained from the human preference data, then optimizing the LLM via PPO algorithm (Schulman et al., 2017) to align with huamn preferences. RLHF requires at least three large models for training, making the process quite complex, and the PPO algorithm itself is highly sophisticated and challenging to parameter-tuning. This drives researchers to explore simpler and more straightforward methods to align language models with human preferences. To simplify alignment, (Rafailov et al., 2023) introduced Direct Preference Optimization (DPO), which provides a closed-form alignment solution and directly uses human preferences for alignment without a separate reward model. Other approaches, like RRHF (Yuan et al., 2023a) and PRO (Song et al., 2024), use SFT-like loss based on multi-ranking datasets to provide richer supervision for alignment.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Oct-15-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.94)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)