Imperceptible Jailbreaking against Large Language Models

Gao, Kuofeng, Li, Yiming, Du, Chao, Wang, Xin, Ma, Xingjun, Xia, Shu-Tao, Pang, Tianyu

Oct-7-2025–arXiv.org Artificial Intelligence

Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Large Language Models (LLMs) (Jiang et al., 2023; Dubey et al., 2024) have demonstrated susceptibility to jailbreaking attacks that can manipulate LLMs to generate harmful outputs. While jailbreaking attacks (Qi et al., 2024) on images generally adopt imperceptible adversarial perturbations, existing textual jailbreaking attacks (Zou et al., 2023; Andriushchenko et al., 2025) operate under an implicit assumption that jailbreak prompts are constructed by visibly modifying malicious questions. Specifically, whether these methods rely on manually designed prompt templates (Shen et al., 2023; Wei et al., 2023a) or automated algorithms (Zou et al., 2023; Jia et al., 2025), they consistently involve the insertion of human-perceptible characters into the original malicious questions. In this paper, we introduce imperceptible jailbreaks by using a set of Unicode characters, i.e., variation selectors (Butler, 2025). V ariation selectors are originally designed to specify glyph variants for some special characters, such as changing emojis in different colors. Instead, we demonstrate that they can be repurposed to form invisible adversarial suffixes appended to malicious questions for jailbreaks. While these characters are imperceptible on screen, they occupy textual space that tokenizers of LLMs can encode.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-7-2025

arXiv.org PDF

Add feedback

Genre:
- Workflow (0.94)
- Research Report > New Finding (0.46)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.36)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found