Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
Galinkin, Erick, Sablotny, Martin
–arXiv.org Artificial Intelligence
The adoption of large language models (LLMs) in many applications, from customer service chat bots and software de - velopment assistants to more capable agentic systems neces - sitates research into how to secure these systems. Attacks l ike prompt injection and jailbreaking attempt to elicit respon ses and actions from these models that are not compliant with the safety, privacy, or content policies of organizations u sing the model in their application. In order to counter abuse of LLMs for generating potentially harmful replies or taking u n-desirable actions, LLM owners must apply safeguards during training and integrate additional tools to block the LLM fro m generating text that abuses the model. Jailbreaking prompt s play a vital role in convincing an LLM to generate potentially harmful content, making it important to identify jai l-breaking attempts to block any further steps. In this work, w e propose a novel approach to detect jailbreak prompts based on pairing text embeddings well-suited for retrieval with t ra-ditional machine learning classification algorithms. Our a p-proach outperforms all publicly available methods from ope n source LLM security applications.
arXiv.org Artificial Intelligence
Dec-2-2024
- Country:
- North America
- Canada (0.14)
- United States (0.14)
- North America
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology > Security & Privacy (0.88)