malicious intent
Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks
Paim, Kayua Oleques, Mansilha, Rodrigo Brandao, Kreutz, Diego, Franco, Muriel Figueredo, Cordeiro, Weverton
The rapid proliferation of Large Language Models (LLMs) has raised significant concerns about their security against adversarial attacks. In this work, we propose a novel approach to crafting universal jailbreaks and data extraction attacks by exploiting latent space discontinuities, an architectural vulnerability related to the sparsity of training data. Initial results indicate that when these discontinuities are exploited, they can consistently and profoundly compromise model behavior, even in the presence of layered defenses. The findings suggest that this strategy has substantial potential as a systemic attack vector. Disclaimer: This paper contains examples of harmful and offensive language. Additional supporting materials may be provided upon formal request and are subject to the signing of a liability and ethical use agreement. Large Language Models (LLMs) are enabling novel applications of Artificial Intelligence (AI) and transforming human activities through conversational models (e.g., ChatGPT, DeepSeek, Gemini, Llama, and Claude). LLMs allow for natural human-AI interaction and specialized applications across multiple domains, including image generation (e.g., Adobe Firefly and Pixlr), code automation (e.g., GitHub Copilot and Amazon CodeWhisperer), and retrieval-augmented generation systems (e.g., Perplexity AI and IBM watsonx). The interactions may happen using different interfaces, such as via direct interaction with the user using a Web interface or indirectly via APIs.
- South America > Brazil > São Paulo (0.04)
- South America > Brazil > Rio Grande do Sul (0.04)
- Asia > South Korea (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.95)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.94)
CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks
Zhang, Xu, Li, Hao, Lu, Zhichao
Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.95)
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Wu, ChenYu, Wang, Yi, Liao, Yang
Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly restricts benign users. We propose a honeypot-based proactive guardrail system that transforms risk avoidance into risk utilization. Our framework fine-tunes a bait model to generate ambiguous, non-actionable but semantically relevant responses, which serve as lures to probe user intent. Combined with the protected LLM's safe reply, the system inserts proactive bait questions that gradually expose malicious intent through multi-turn interactions. We further introduce the Honeypot Utility Score (HUS), measuring both the attractiveness and feasibility of bait responses, and use a Defense Efficacy Rate (DER) for balancing safety and usability. Initial experiment on MHJ Datasets with recent attack method across GPT-4o show that our system significantly disrupts jailbreak success while preserving benign user experience.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > China > Shaanxi Province > Xi'an (0.05)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
SafeMT: Multi-turn Safety for Multimodal Language Models
Zhu, Han, Dai, Juntao, Ji, Jiaming, Li, Haoran, Cai, Chengkun, Wen, Pengcheng, Chan, Chi-Min, Chen, Boyuan, Yang, Yaodong, Han, Sirui, Guo, Yike
With the widespread use of multi-modal Large Language models (MLLMs), safety issues have become a growing concern. Multi-turn dialogues, which are more common in everyday interactions, pose a greater risk than single prompts; however, existing benchmarks do not adequately consider this situation. To encourage the community to focus on the safety issues of these models in multi-turn dialogues, we introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images. This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods. Additionally, we propose Safety Index (SI) to evaluate the general safety of MLLMs during conversations. We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises. This observation indicates that the safety mechanisms of these models are inadequate for recognizing the hazard in dialogue interactions. We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies. Experimental results from several open-source models indicate that this moderator is more effective in reducing multi-turn ASR compared to existed guard models.
- Asia > China > Hong Kong (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Law Enforcement & Public Safety (0.67)
Bypassing Prompt Guards in Production with Controlled-Release Prompting
Fairoze, Jaiden, Garg, Sanjam, Lee, Keewoo, Wang, Mingyuan
As large language models (LLMs) advance, ensuring AI safety and alignment is paramount. One popular approach is prompt guards, lightweight mechanisms designed to filter malicious queries while being easy to implement and update. In this work, we introduce a new attack that circumvents such prompt guards, highlighting their limitations. Our method consistently jailbreaks production models while maintaining response quality, even under the highly protected chat interfaces of Google Gemini (2.5 Flash/Pro), DeepSeek Chat (DeepThink), Grok (3), and Mistral Le Chat (Magistral). The attack exploits a resource asymmetry between the prompt guard and the main LLM, encoding a jailbreak prompt that lightweight guards cannot decode but the main model can. This reveals an attack surface inherent to lightweight prompt guards in modern LLM architectures and underscores the need to shift defenses from blocking malicious inputs to preventing malicious outputs. We additionally identify other critical alignment issues, such as copyrighted data extraction, training data extraction, and malicious response leakage during thinking.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > New York (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (4 more...)
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (0.46)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models
Krishna, Satyapriya, Zou, Andy, Gupta, Rahul, Jones, Eliot Krzysztof, Winter, Nick, Hendrycks, Dan, Kolter, J. Zico, Fredrikson, Matt, Matsoukas, Spyros
The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning. This vulnerability, often triggered by sophisticated system prompt injections, allows models to bypass conventional safety filters, posing a significant, underexplored risk. To address this gap, we introduce the Deceptive Reasoning Exposure Suite (D-REX), a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output. D-REX was constructed through a competitive red-teaming exercise where participants crafted adversarial system prompts to induce such deceptive behaviors. Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought, which reveals the underlying malicious intent. Our benchmark facilitates a new, essential evaluation task: the detection of deceptive alignment. We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms, highlighting the urgent need for new techniques that scrutinize the internal processes of LLMs, not just their final outputs.
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > Pennsylvania (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Health & Medicine (0.93)
- Government > Military (0.93)
Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models
Zheng, Youjia, Zandsalimy, Mohammad, Sushmita, Shanu
Large Language Models (LLMs) are increasingly vulnerable to a sophisticated form of adversarial prompting known as camouflaged jailbreaking. This method embeds malicious intent within seemingly benign language to evade existing safety mechanisms. Unlike overt attacks, these subtle prompts exploit contextual ambiguity and the flexible nature of language, posing significant challenges to current defense systems. This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods. We introduce a novel benchmark dataset, Camouflaged Jailbreak Prompts, containing 500 curated examples (400 harmful and 100 benign prompts) designed to rigorously stress-test LLM safety protocols. In addition, we propose a multi-faceted evaluation framework that measures harmfulness across seven dimensions: Safety Awareness, Technical Feasibility, Implementation Safeguards, Harmful Potential, Educational Value, Content Quality, and Compliance Score. Our findings reveal a stark contrast in LLM behavior: while models demonstrate high safety and content quality with benign inputs, they exhibit a significant decline in performance and safety when confronted with camouflaged jailbreak attempts. This disparity underscores a pervasive vulnerability, highlighting the urgent need for more nuanced and adaptive security strategies to ensure the responsible and robust deployment of LLMs in real-world applications.
- North America > United States (0.04)
- North America > Canada > British Columbia (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality
Li, Nanxi, Zhao, Zhengyue, Xiao, Chaowei
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated Safety in Multimodality), a system2-like framework that aligns VLMs by embedding a structured, safety-aware reasoning process. Our framework consists of two key components: PRISM-CoT, a dataset that teaches safety-aware chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to further refine this reasoning through Direct Preference Optimization to help obtain a delicate safety boundary. Comprehensive evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90% improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also exhibits strong robustness against adaptive attacks, significantly increasing computational costs for adversaries, and generalizes effectively to out-of-distribution challenges, reducing attack success rates to just 8.70% on the challenging multi-image MIS benchmark. Remarkably, this robust defense is achieved while preserving, and in some cases enhancing, model utility. To promote reproducibility, we have made our code, data, and model weights available at https://github.com/SaFoLab-WISC/PRISM.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Concealment of Intent: A Game-Theoretic Analysis
Wu, Xinbo, Umrawal, Abhishek, Varshney, Lav R.
As large language models (LLMs) grow more capable, concerns about their safe deployment have also grown. Although alignment mechanisms have been introduced to deter misuse, they remain vulnerable to carefully designed adversarial prompts. In this work, we present a scalable attack strategy: intent-hiding adversarial prompting, which conceals malicious intent through the composition of skills. We develop a game-theoretic framework to model the interaction between such attacks and defense systems that apply both prompt and response filtering. Our analysis identifies equilibrium points and reveals structural advantages for the attacker. To counter these threats, we propose and analyze a defense mechanism tailored to intent-hiding attacks. Empirically, we validate the attack's effectiveness on multiple real-world LLMs across a range of malicious behaviors, demonstrating clear advantages over existing adversarial prompting techniques.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?
Wahed, Muntasir, Zhou, Xiaona, Nguyen, Kiet A., Yu, Tianjiao, Diwan, Nirav, Wang, Gang, Hakkani-Tür, Dilek, Lourentzou, Ismini
Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)