AdvPrefix: An Objective for Nuanced LLM Jailbreaks
Zhu, Sicheng, Amos, Brandon, Tian, Yuandong, Guo, Chuan, Evtimov, Ivan
–arXiv.org Artificial Intelligence
Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix "Sure, here is (harmful request)". While straightforward, this objective has two limitations: limited control over model behaviors, often resulting in incomplete or unrealistic responses, and a rigid format that hinders optimization. To address these limitations, we introduce AdvPrefix, a new prefix-forcing objective that enables more nuanced control over model behavior while being easy to optimize. Our objective leverages model-dependent prefixes, automatically selected based on two criteria: high prefilling attack success rates and low negative log-likelihood. It can further simplify optimization by using multiple prefixes for a single user request. AdvPrefix can integrate seamlessly into existing jailbreak attacks to improve their performance for free. For example, simply replacing GCG attack's target prefixes with ours on Llama-3 improves nuanced attack success rates from 14% to 80%, suggesting that current alignment struggles to generalize to unseen prefixes. Our work demonstrates the importance of jailbreak objectives in achieving nuanced jailbreaks.
arXiv.org Artificial Intelligence
Dec-13-2024
- Country:
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- North America
- Dominican Republic (0.04)
- United States > Maryland
- Prince George's County > College Park (0.04)
- Europe > Ireland
- Leinster > County Dublin > Dublin (0.04)
- South America > Colombia
- Genre:
- Research Report > New Finding (0.93)
- Technology: