Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

Dec-2-2024–arXiv.org Artificial Intelligence

The adoption of large language models (LLMs) in many applications, from customer service chat bots and software de - velopment assistants to more capable agentic systems neces - sitates research into how to secure these systems. Attacks l ike prompt injection and jailbreaking attempt to elicit respon ses and actions from these models that are not compliant with the safety, privacy, or content policies of organizations u sing the model in their application. In order to counter abuse of LLMs for generating potentially harmful replies or taking u n-desirable actions, LLM owners must apply safeguards during training and integrate additional tools to block the LLM fro m generating text that abuses the model. Jailbreaking prompt s play a vital role in convincing an LLM to generate potentially harmful content, making it important to identify jai l-breaking attempts to block any further steps. In this work, w e propose a novel approach to detect jailbreak prompts based on pairing text embeddings well-suited for retrieval with t ra-ditional machine learning classification algorithms. Our a p-proach outperforms all publicly available methods from ope n source LLM security applications.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Dec-2-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Canada > Alberta
    - Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
  - United States > California
    - Santa Clara County > Santa Clara (0.04)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (0.88)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Learning Graphical Models > Directed Networks
      - Bayesian Learning (0.46)
    - Performance Analysis > Accuracy (0.98)
  - Natural Language > Large Language Model (1.00)