Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Peng, Alwin, Michael, Julian, Sleight, Henry, Perez, Ethan, Sharma, Mrinank

Nov-11-2024–arXiv.org Artificial Intelligence

As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse. As Large Language Models (LLMs) become more capable, they pose greater misuse risks. Indeed, the potential for catastrophic misuse of LLMs has motivated AI labs to make public commitments to developing safeguards to minimize the risk of such misuse (Anthropic, 2023; OpenAI, 2023). Additionally, such concerns have motivated substantial effort from the research community to defend against jailbreaks, which are techniques that extract harmful information from LLMs trained to be helpful, harmless, and honest (Bai et al., 2022b; Xie et al., 2023; Xu et al., 2024). Despite ongoing research, ensuring that large language models (LLMs) are robustly resistant to jailbreaking remains an unsolved challenge (Hendrycks et al., 2021; Ziegler et al., 2022). Even state-of-the-art methods that substantially improve robustness, such as representation rerouting (Zou et al., 2024), have been publicly broken within hours of release. The situation could worryingly parallel that of adversarial robustness in computer vision, where new defenses are often defeated by attacks available before their development with proper tuning (Tramer et al., 2020). Indeed, in computer vision, a decade of work and thousands of papers have yielded "limited progress" (Carlini, 2024).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Nov-11-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.35)
  - Natural Language > Large Language Model (1.00)