Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Wang, Tony T., Hughes, John, Sleight, Henry, Schaeffer, Rylan, Agrawal, Rajashree, Barez, Fazl, Sharma, Mrinank, Mu, Jesse, Shavit, Nir, Perez, Ethan
–arXiv.org Artificial Intelligence
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.
arXiv.org Artificial Intelligence
Dec-2-2024
- Country:
- North America > United States (0.28)
- Genre:
- Instructional Material (0.92)
- Research Report (1.00)
- Workflow (1.00)
- Industry:
- Education > Educational Setting
- K-12 Education (0.46)
- Information Technology > Security & Privacy (0.46)
- Law > Criminal Law (0.67)
- Law Enforcement & Public Safety
- Crime Prevention & Enforcement (0.46)
- Terrorism (0.46)
- Materials > Chemicals (0.68)
- Education > Educational Setting
- Technology: