malicious question
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- Health & Medicine > Therapeutic Area (0.93)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Security & Privacy (0.93)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- Health & Medicine > Therapeutic Area (0.93)
- (2 more...)
Imperceptible Jailbreaking against Large Language Models
Gao, Kuofeng, Li, Yiming, Du, Chao, Wang, Xin, Ma, Xingjun, Xia, Shu-Tao, Pang, Tianyu
Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Large Language Models (LLMs) (Jiang et al., 2023; Dubey et al., 2024) have demonstrated susceptibility to jailbreaking attacks that can manipulate LLMs to generate harmful outputs. While jailbreaking attacks (Qi et al., 2024) on images generally adopt imperceptible adversarial perturbations, existing textual jailbreaking attacks (Zou et al., 2023; Andriushchenko et al., 2025) operate under an implicit assumption that jailbreak prompts are constructed by visibly modifying malicious questions. Specifically, whether these methods rely on manually designed prompt templates (Shen et al., 2023; Wei et al., 2023a) or automated algorithms (Zou et al., 2023; Jia et al., 2025), they consistently involve the insertion of human-perceptible characters into the original malicious questions. In this paper, we introduce imperceptible jailbreaks by using a set of Unicode characters, i.e., variation selectors (Butler, 2025). V ariation selectors are originally designed to specify glyph variants for some special characters, such as changing emojis in different colors. Instead, we demonstrate that they can be repurposed to form invisible adversarial suffixes appended to malicious questions for jailbreaks. While these characters are imperceptible on screen, they occupy textual space that tokenizers of LLMs can encode.
- Workflow (0.94)
- Research Report > New Finding (0.46)
Boosting Jailbreak Transferability for Large Language Models
Liu, Hanqing, Zhou, Lifeng, Yan, Huanqian
Large language models have drawn significant attention to the challenge of safe alignment, especially regarding jailbreak attacks that circumvent security measures to produce harmful content. To address the limitations of existing methods like GCG, which perform well in single-model attacks but lack transferability, we propose several enhancements, including a scenario induction template, optimized suffix selection, and the integration of re-suffix attack mechanism to reduce inconsistent outputs. Our approach has shown superior performance in extensive experiments across various benchmarks, achieving nearly 100% success rates in both attack execution and transferability. Notably, our method has won the first place in the AISG-hosted Global Challenge for Safe and Secure LLMs.
- Asia > China > Beijing > Beijing (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Singapore (0.04)
- (2 more...)
Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
Liu, Xiao, Li, Liangzhi, Xiang, Tong, Ye, Fuying, Wei, Lu, Li, Wangyue, Garcia, Noa
With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Jia, Xiaojun, Pang, Tianyu, Du, Chao, Huang, Yihao, Gu, Jindong, Liu, Yang, Cao, Xiaochun, Lin, Min
Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Asia > Singapore (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Jin, Haibo, Zhou, Andy, Menke, Joe D., Wang, Haohan
Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as ``jailbreaks'', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ($\sim$ $\times$ 19.88) and lower filtered-out rates ($\sim$ $\times$ 1/6) than baselines.
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- Health & Medicine > Therapeutic Area (0.94)
- (2 more...)
Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology
Wang, Zhenhua, Xie, Wei, Wang, Baosheng, Wang, Enze, Gui, Zhiwen, Ma, Shuoyoucheng, Chen, Kai
Large Language Models (LLMs) have gradually become the gateway for people to acquire new knowledge. However, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." Previous studies have shown the weakness of current LLMs when confronted with such jailbreaking attacks. Nevertheless, comprehension of the intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak prompts is noticeably lacking. Our research provides a psychological explanation of the jailbreak prompts. Drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the LLM to achieve cognitive coordination in an erroneous direction. Further, we propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique. This method progressively induces the model to answer harmful questions via multi-step incremental prompts. We instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced LLMs, yielding an average success rate of 83.9%. This study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of LLMs.
A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
Li, Jie, Liu, Yi, Liu, Chongyang, Shi, Ling, Ren, Xiaoning, Zheng, Yaowen, Liu, Yang, Xue, Yinxing
Large Language Models (LLMs) have become increasingly popular for their advanced text generation capabilities across various domains. However, like any software, they face security challenges, including the risk of 'jailbreak' attacks that manipulate LLMs to produce prohibited content. A particularly underexplored area is the Multilingual Jailbreak attack, where malicious questions are translated into various languages to evade safety filters. Currently, there is a lack of comprehensive empirical studies addressing this specific threat. To address this research gap, we conducted an extensive empirical study on Multilingual Jailbreak attacks. We developed a novel semantic-preserving algorithm to create a multilingual jailbreak dataset and conducted an exhaustive evaluation on both widely-used open-source and commercial LLMs, including GPT-4 and LLaMa. Additionally, we performed interpretability analysis to uncover patterns in Multilingual Jailbreak attacks and implemented a fine-tuning mitigation method. Our findings reveal that our mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. This study provides valuable insights into understanding and mitigating Multilingual Jailbreak attacks.
- North America > United States (0.14)
- Asia > China (0.04)
- Asia > Indonesia > Bali (0.04)
- Africa (0.04)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering
Li, Tianlong, Zheng, Xiaoqing, Huang, Xuanjing
Getting large language models (LLMs) to refuse to answer hostile toxicity questions is a core issue under the theme of LLMs security. Previous approaches have used prompts engineering to jailbreak LLMs and answer some toxicity questions. These approaches can easily fail after the model manufacturer makes additional fine-tuning to the model. To promote the further understanding of model jailbreaking by researchers, we are inspired by Representation Engineering to propose a jailbreaking method that does not require elaborate construction prompts, is not affected by model fine-tuning, and can be widely applied to any open-source LLMs in a pluggable manner. We have evaluated this method on multiple mainstream LLMs on carefully supplemented toxicity datasets, and the experimental results demonstrate the significant effectiveness of our approach. After being surprised by some interesting jailbreaking cases, we did extensive in-depth research to explore the techniques behind this method.