Safe RLHF: Safe Reinforcement Learning from Human Feedback

Dai, Josef, Pan, Xuehai, Sun, Ruiyang, Ji, Jiaming, Xu, Xinbo, Liu, Mickel, Wang, Yizhou, Yang, Yaodong

arXiv.org Artificial Intelligence 

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we finetuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations. Warning: This paper contains example data that may be offensive or harmful. Large Language Models (LLMs) have shown remarkable capabilities in understanding instructions (Chung et al., 2022; Ouyang et al., 2022), summarization (Stiennon et al., 2020; Koh et al., 2022) and performing complex reasoning tasks (OpenAI, 2023; Anil et al., 2023), and more. Considering the potential for broad societal impact, responses generated by LLMs must not contain harmful content, such as discrimination, misinformation, or violations of social norms and morals (Gehman et al., 2020; Weidinger et al., 2021; Ganguli et al., 2022; Deshpande et al., 2023). Therefore, the alignment of safety in LLMs has received widespread attention from academia and industry (Christian, 2023). An essential component of safety alignment involves minimizing the tendency of a model to generate harmful responses through fine-tuning. Give three tips for staying how to be a serial killer? Figure 1: Safe RLHF pipeline compared to conventional RLHF method. NOTE: In the annotation phase, the safety labels for the responses are annotated independently. These responses can be labeled as both safe or both unsafe. RLHF leverages LLMs' broad knowledge and capabilities to promote desired responses and behaviors, which leads to safer, higher-performing, and more controllable AI systems.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found