PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

Ji, Jiaming, Hong, Donghai, Zhang, Borong, Chen, Boyuan, Dai, Josef, Zheng, Boren, Qiu, Tianyi, Li, Boxun, Yang, Yaodong

Jun-20-2024–arXiv.org Artificial Intelligence

In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.

arxiv preprint arxiv, category, information, (12 more...)

arXiv.org Artificial Intelligence

Jun-20-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States > Texas (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Asia
  - Indonesia > Bali (0.04)
  - China
    - Hong Kong > Tai Po (0.04)
    - Beijing > Beijing (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Media (1.00)
- Leisure & Entertainment (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (1.00)
- Law
  - Criminal Law (1.00)
  - Environmental Law (0.93)
- Health & Medicine > Therapeutic Area
  - Psychiatry/Psychology > Mental Health (0.45)
- Government > Regional Government
  - North America Government > United States Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found