XBreaking: Understanding how LLMs security alignment can be broken

Arazzi, Marco, Kembu, Vignesh Kumar, Nocera, Antonino, P, Vinod

Nov-10-2025–arXiv.org Artificial Intelligence

Abstract--Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government organizations and medical institutions. For this reason, commercial LLMs typically undergo a sophisticated censoring mechanism to eliminate any harmful output they could possibly produce. These mechanisms maintain the integrity of LLM alignment by guaranteeing that the models respond safely and ethically. In response to this, attacks on LLMs are a significant threat to such protections, and many previous approaches have already demonstrated their effectiveness across diverse domains. Existing LLM attacks mostly adopt a generate-and-test strategy to craft malicious input. T o improve the comprehension of censoring mechanisms and design a targeted attack, we propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns. Then, we propose XBreaking, a novel approach that exploits these unique patterns to break the security and alignment constraints of LLMs by targeted noise injection. Our thorough experimental campaign returns important insights about the censoring mechanisms and demonstrates the effectiveness and performance of our approach. Nowadays, Large Language Models (LLMs, for short) represent the most promising and relevant advancement in the field of Artificial Intelligence. These complex deep learning models are trained on massive datasets that cover almost all aspects of people's daily lives, thus granting them the capability of generating, understanding, and processing human language. For this reason, their integration as support tools is becoming pervasive with applications spanning from text editor and proofreading to virtual assistant and personalized text generation. However, the diffusion of this technology, especially in critical domains such as government organizations and medical institutions, imposes the assessment of their security and privacy characteristics.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-10-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Law > Civil Rights & Constitutional Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found