AITopics | malicious question

Collaborating Authors

malicious question

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

6d56bc83ae9a4fafdce050bb36f04174-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 16:17:56 GMT

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Champaign County > Champaign (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Security & Privacy (0.93)

Add feedback

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Neural Information Processing SystemsOct-10-2025, 05:24:00 GMT

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as "jailbreaks", which can bypass protective measures and induce harmful behavior.

arxiv preprint arxiv, cipher character, moderation guardrail, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Champaign County > Champaign (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Imperceptible Jailbreaking against Large Language Models

Gao, Kuofeng, Li, Yiming, Du, Chao, Wang, Xin, Ma, Xingjun, Xia, Shu-Tao, Pang, Tianyu

arXiv.org Artificial IntelligenceOct-7-2025

Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Large Language Models (LLMs) (Jiang et al., 2023; Dubey et al., 2024) have demonstrated susceptibility to jailbreaking attacks that can manipulate LLMs to generate harmful outputs. While jailbreaking attacks (Qi et al., 2024) on images generally adopt imperceptible adversarial perturbations, existing textual jailbreaking attacks (Zou et al., 2023; Andriushchenko et al., 2025) operate under an implicit assumption that jailbreak prompts are constructed by visibly modifying malicious questions. Specifically, whether these methods rely on manually designed prompt templates (Shen et al., 2023; Wei et al., 2023a) or automated algorithms (Zou et al., 2023; Jia et al., 2025), they consistently involve the insertion of human-perceptible characters into the original malicious questions. In this paper, we introduce imperceptible jailbreaks by using a set of Unicode characters, i.e., variation selectors (Butler, 2025). V ariation selectors are originally designed to specify glyph variants for some special characters, such as changing emojis in different colors. Instead, we demonstrate that they can be repurposed to form invisible adversarial suffixes appended to malicious questions for jailbreaks. While these characters are imperceptible on screen, they occupy textual space that tokenizers of LLMs can encode.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.05025

Genre:

Workflow (0.94)
Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

Boosting Jailbreak Transferability for Large Language Models

Liu, Hanqing, Zhou, Lifeng, Yan, Huanqian

arXiv.org Artificial IntelligenceNov-3-2024

Large language models have drawn significant attention to the challenge of safe alignment, especially regarding jailbreak attacks that circumvent security measures to produce harmful content. To address the limitations of existing methods like GCG, which perform well in single-model attacks but lack transferability, we propose several enhancements, including a scenario induction template, optimized suffix selection, and the integration of re-suffix attack mechanism to reduce inconsistent outputs. Our approach has shown superior performance in extensive experiments across various benchmarks, achieving nearly 100% success rates in both attack execution and transferability. Notably, our method has won the first place in the AISG-hosted Global Challenge for Safe and Secure LLMs.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2410.15645

Country:

Asia > China > Beijing > Beijing (0.05)
North America > United States > New York > New York County > New York City (0.04)
Asia > Singapore (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Liu, Xiao, Li, Liangzhi, Xiang, Tong, Ye, Fuying, Wei, Lu, Li, Wangyue, Garcia, Noa

arXiv.org Artificial IntelligenceJul-22-2024

With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?

information, llm, malicious question, (16 more...)

arXiv.org Artificial Intelligence

2407.15399

Country:

Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)
(6 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Jia, Xiaojun, Pang, Tianyu, Du, Chao, Huang, Yihao, Gu, Jindong, Liu, Yang, Cao, Xiaochun, Lin, Min

arXiv.org Artificial IntelligenceJun-5-2024

Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.

arxiv preprint arxiv, jailbreak suffix, suffix, (13 more...)

arXiv.org Artificial Intelligence

2405.21018

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Asia > Singapore (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Jin, Haibo, Zhou, Andy, Menke, Joe D., Wang, Haohan

arXiv.org Artificial IntelligenceMay-30-2024

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as ``jailbreaks'', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ($\sim$ $\times$ 19.88) and lower filtered-out rates ($\sim$ $\times$ 1/6) than baselines.

arxiv preprint arxiv, jailbreak prompt, moderation guardrail, (12 more...)

arXiv.org Artificial Intelligence

2405.20413

Country:

North America > United States > Illinois > Champaign County > Champaign (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area (0.94)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

Add feedback

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

Wang, Zhenhua, Xie, Wei, Wang, Baosheng, Wang, Enze, Gui, Zhiwen, Ma, Shuoyoucheng, Chen, Kai

arXiv.org Artificial IntelligenceFeb-23-2024

Large Language Models (LLMs) have gradually become the gateway for people to acquire new knowledge. However, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." Previous studies have shown the weakness of current LLMs when confronted with such jailbreaking attacks. Nevertheless, comprehension of the intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak prompts is noticeably lacking. Our research provides a psychological explanation of the jailbreak prompts. Drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the LLM to achieve cognitive coordination in an erroneous direction. Further, we propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique. This method progressively induces the model to answer harmful questions via multi-step incremental prompts. We instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced LLMs, yielding an average success rate of 83.9%. This study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of LLMs.

jailbreak prompt, llm, malicious question, (16 more...)

arXiv.org Artificial Intelligence

2402.1569

Country: Asia > India (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Li, Jie, Liu, Yi, Liu, Chongyang, Shi, Ling, Ren, Xiaoning, Zheng, Yaowen, Liu, Yang, Xue, Yinxing

arXiv.org Artificial IntelligenceJan-30-2024

Large Language Models (LLMs) have become increasingly popular for their advanced text generation capabilities across various domains. However, like any software, they face security challenges, including the risk of 'jailbreak' attacks that manipulate LLMs to produce prohibited content. A particularly underexplored area is the Multilingual Jailbreak attack, where malicious questions are translated into various languages to evade safety filters. Currently, there is a lack of comprehensive empirical studies addressing this specific threat. To address this research gap, we conducted an extensive empirical study on Multilingual Jailbreak attacks. We developed a novel semantic-preserving algorithm to create a multilingual jailbreak dataset and conducted an exhaustive evaluation on both widely-used open-source and commercial LLMs, including GPT-4 and LLaMa. Additionally, we performed interpretability analysis to uncover patterns in Multilingual Jailbreak attacks and implemented a fine-tuning mitigation method. Our findings reveal that our mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. This study provides valuable insights into understanding and mitigating Multilingual Jailbreak attacks.

dataset, jailbreak template, llm, (10 more...)

arXiv.org Artificial Intelligence

2401.16765

Country:

North America > United States (0.14)
Asia > China (0.04)
Asia > Indonesia > Bali (0.04)
Africa (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering

Li, Tianlong, Zheng, Xiaoqing, Huang, Xuanjing

arXiv.org Artificial IntelligenceJan-11-2024

Getting large language models (LLMs) to refuse to answer hostile toxicity questions is a core issue under the theme of LLMs security. Previous approaches have used prompts engineering to jailbreak LLMs and answer some toxicity questions. These approaches can easily fail after the model manufacturer makes additional fine-tuning to the model. To promote the further understanding of model jailbreaking by researchers, we are inspired by Representation Engineering to propose a jailbreaking method that does not require elaborate construction prompts, is not affected by model fine-tuning, and can be widely applied to any open-source LLMs in a pluggable manner. We have evaluated this method on multiple mainstream LLMs on carefully supplemented toxicity datasets, and the experimental results demonstrate the significant effectiveness of our approach. After being surprised by some interesting jailbreaking cases, we did extensive in-depth research to explore the techniques behind this method.

defense representation, dimension, representation, (13 more...)

arXiv.org Artificial Intelligence

2401.06824

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback