Single-pass Detection of Jailbreaking Input in Large Language Models

Candogan, Leyla Naz, Wu, Yongtao, Rocamora, Elias Abad, Chrysos, Grigorios G., Cevher, Volkan

Feb-21-2025–arXiv.org Artificial Intelligence

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jail-breaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-21-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.28)
- North America > United States
  - Wisconsin (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Government (1.00)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Performance Analysis > Accuracy (1.00)
  - Natural Language > Large Language Model (1.00)