Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Wang, Tony T., Hughes, John, Sleight, Henry, Schaeffer, Rylan, Agrawal, Rajashree, Barez, Fazl, Sharma, Mrinank, Mu, Jesse, Shavit, Nir, Perez, Ethan

Dec-2-2024–arXiv.org Artificial Intelligence

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.

classifier, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Dec-2-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Instructional Material (0.92)
- Research Report (1.00)
- Workflow (1.00)

Industry:
- Education > Educational Setting
  - K-12 Education (0.46)
- Information Technology > Security & Privacy (0.46)
- Law > Criminal Law (0.67)
- Law Enforcement & Public Safety
  - Crime Prevention & Enforcement (0.46)
  - Terrorism (0.46)
- Materials > Chemicals (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found