Jailbreaking as a Reward Misspecification Problem

Xie, Zhihui, Gao, Jiahui, Li, Lei, Li, Zhenguo, Liu, Qi, Kong, Lingpeng

Jul-12-2024–arXiv.org Artificial Intelligence

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.

arxiv preprint arxiv, remiss, suffix, (14 more...)

arXiv.org Artificial Intelligence

Jul-12-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > San Francisco County > San Francisco (0.14)
- Asia > China
  - Hong Kong (0.04)
- Africa > Eswatini
  - Manzini > Manzini (0.04)

Genre:
- Instructional Material (0.93)
- Research Report > New Finding (0.67)

Industry:
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Government (1.00)
- Banking & Finance (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found