UltraFeedback: Boosting Language Models with High-quality Feedback

Cui, Ganqu, Yuan, Lifan, Ding, Ning, Yao, Guanming, Zhu, Wei, Ni, Yuan, Xie, Guotong, Liu, Zhiyuan, Sun, Maosong

arXiv.org Artificial Intelligence 

Reinforcement learning from human feedback (RLHF) has become a pivot technique in aligning large language models (LLMs) with human preferences. In RLHF practice, preference data plays a crucial role in bridging human proclivity and LLMs. However, the scarcity of diverse, naturalistic datasets of human preferences on LLM outputs at scale poses a great challenge to RLHF as well as feedback learning research within the open-source community. Current preference datasets, either proprietary or limited in size and prompt variety, result in limited RLHF adoption in open-source models and hinder further exploration. We meticulously devise annotation instructions and employ GPT-4 to offer detailed feedback in both numerical and textual forms. Experimental results indicate that our models outperform existing open-source models, achieving top performance across multiple benchmarks. Large language models (LLMs), represented by ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, 2023), have demonstrated proficiency in generating fluent text as well as solving various languageoriented tasks. Trained on massive corpora through likelihood maximization techniques, these LLMs have exhibited remarkable generalization and equipped the ability to execute diverse tasks in response to user directives (Ouyang et al., 2022; Wei et al., 2022; Sanh et al., 2022). Unfortunately, relying solely on likelihood maximization during training leads to well-known issues - LLMs may generate convincing but incorrect or unsafe content that deviates from human preferences (Stiennon et al., 2020; Ouyang et al., 2022; Perez et al., 2022). To further align LLMs with human preferences, reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022; Askell et al., 2021; Bai et al., 2022a; Touvron et al., 2023b) has been introduced and widely adopted by leading corporations. RLHF builds upon preference data, which rates and compares different responses given the same prompt. Typically, RLHF trains a reward model on preference data and then applies RL algorithms such as Proximal Policy Optimization (PPO) (Schulman et al., 2017) on LLMs to optimize the rewards (OpenAI, 2022; 2023; Touvron et al., 2023b; Bai et al., 2022a). While proprietary models have largely capitalized on RLHF's potential to produce outputs that are both more useful and safer, a significant gap persists in the open-source community. As a result, few open-source models adopt RLHF as it demonstrates marginal gains, which critically hinders successful RLHF practice and further research.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found