RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Yuan, Zheng, Yuan, Hongyi, Tan, Chuanqi, Wang, Wei, Huang, Songfang, Huang, Fei

Oct-7-2023–arXiv.org Artificial Intelligence

InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). However, PPO is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. In contrast, we propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. RRHF can leverage sampled responses from various sources including the model responses from itself, other large language model responses, and human expert responses to learn to rank them. RRHF only needs 1 to 2 models during tuning and can efficiently align language models with human preferences robustly without complex hyperparameter tuning. Additionally, RRHF can be considered an extension of SFT and reward model training while being simpler than PPO in terms of coding, model counts, and hyperparameters. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-n learner.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-7-2023

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)
- North America > Canada (0.14)
- Oceania > Australia (0.14)

Genre:
- Research Report (0.50)

Industry:
- Leisure & Entertainment (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (0.96)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found