Jailbreaking is (Mostly) Simpler Than You Think

Mar-7-2025–arXiv.org Artificial Intelligence

The rapid advancement of artificial intelligence has coincided with increasing concerns regarding the safe and ethical deployment of these systems. As AI models become more capable, ensuring that their behavior aligns with societal norms and safety standards has emerged as a critical research challenge. State-of-the-art alignment techniques--such as reinforcement learning from human feedback and rulebased fine-tuning--strive to constrain models to acceptable ethical behaviors. However, these methods face an inherent tension: while alignment is designed to prevent the disclosure of harmful or sensitive information, adversaries can leverage the gap between a model's potential and its restricted behavior through what is known as a jailbreak. In the context of AI, a jailbreak is any method that circumvents established safety protocols, effectively enabling functionalities that the system would otherwise suppress. Current jailbreaks typically deploy elaborate prompt constructions or optimization strategies; in contrast, in this paper we present the Context Compliance Attack (CCA), a simple optimization-free jailbreak. CCA leverages a basic yet critical design flaw--the reliance on client-supplied conversation history--to subvert the AI systems' safeguards and jailbreak them. This paper investigates the efficacy of CCA and explores its implications on current AI safety architectures.

conversation history, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Mar-7-2025

arXiv.org PDF

Add feedback

Genre:
- Overview (0.89)
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (0.50)
- Law (0.68)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.30)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.90)
  - Natural Language > Large Language Model (1.00)