ttack
HauntAttack: When Attack Follows Reasoning as a Shadow
Ma, Jingyuan, Li, Rui, Li, Zheng, Liu, Junfeng, Xia, Heming, Sha, Lei, Sui, Zhifang
Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of 70\%, achieving up to 12 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England (0.04)
- Europe > France (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Overview (0.93)
- Law Enforcement & Public Safety (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Government > Military (0.88)
A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning
Gluch, Greg, Goldwasser, Shafi
In this paper, we initiate a cryptographically inspired theoretical study of detection versus mitigation of adversarial inputs produced by attackers on Machine Learning algorithms during inference time. We formally define defense by detection (DbD) and defense by mitigation (DbM). Our definitions come in the form of a 3-round protocol between two resource-bounded parties: a trainer/defender and an attacker. The attacker aims to produce inference-time inputs that fool the training algorithm. We define correctness, completeness, and soundness properties to capture successful defense at inference time while not degrading (too much) the performance of the algorithm on inputs from the training distribution. We first show that achieving DbD and achieving DbM are equivalent for ML classification tasks. Surprisingly, this is not the case for ML generative learning tasks, where there are many possible correct outputs for each input. We show a separation between DbD and DbM by exhibiting two generative learning tasks for which it is possible to defend by mitigation but it is provably impossible to defend by detection. The mitigation phase uses significantly less computational resources than the initial training algorithm. In the first learning task we consider sample complexity as the resource and in the second the time complexity. The first result holds under the assumption that the Identity-Based Fully Homomorphic Encryption (IB-FHE), publicly-verifiable zero-knowledge Succinct Non-Interactive Arguments of Knowledge (zk-SNARK), and Strongly Unforgeable Signatures exist. The second result assumes the existence of Non-Parallelizing Languages with Average-Case Hardness (NPL) and Incrementally-Verifiable Computation (IVC) and IB-FHE.
- North America > Canada > British Columbia > Vancouver (0.04)
- North America > United States > California (0.04)
- North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
Cognitive Overload Attack:Prompt Injection for Long Context
Upadhayay, Bibek, Behzadan, Vahid, Karbasi, Amin
Large Language Models (LLMs) have demonstrated remarkable capabilities in performing tasks across various domains without needing explicit retraining. This capability, known as In-Context Learning (ICL), while impressive, exposes LLMs to a variety of adversarial prompts and jailbreaks that manipulate safety-trained LLMs into generating undesired or harmful output. In this paper, we propose a novel interpretation of ICL in LLMs through the lens of cognitive neuroscience, by drawing parallels between learning in human cognition with ICL. We applied the principles of Cognitive Load Theory in LLMs and empirically validate that similar to human cognition, LLMs also suffer from cognitive overload a state where the demand on cognitive processing exceeds the available capacity of the model, leading to potential errors. Furthermore, we demonstrated how an attacker can exploit ICL to jailbreak LLMs through deliberately designed prompts that induce cognitive overload on LLMs, thereby compromising the safety mechanisms of LLMs. We empirically validate this threat model by crafting various cognitive overload prompts and show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct, Gemini-1.0-Pro, and Gemini-1.5-Pro can be successfully jailbroken, with attack success rates of up to 99.99%. Our findings highlight critical vulnerabilities in LLMs and underscore the urgency of developing robust safeguards. We propose integrating insights from cognitive load theory into the design and evaluation of LLMs to better anticipate and mitigate the risks of adversarial attacks. By expanding our experiments to encompass a broader range of models and by highlighting vulnerabilities in LLMs' ICL, we aim to ensure the development of safer and more reliable AI systems.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Connecticut > New Haven County > West Haven (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.85)
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
Jiang, Yifan, Aggarwal, Kriti, Laud, Tanmay, Munir, Kashif, Pujara, Jay, Mukherjee, Subhabrata
The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model's performance across standard benchmarks. Full implementation and dataset are publicly accessible at https://github.com/kriti-hippo/red_queen.
- North America > United States > California (0.14)
- North America > Mexico (0.04)
TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models
Li, Zelin, Chen, Kehai, Liu, Lemao, Bai, Xuefeng, Yang, Mingming, Xiang, Yang, Zhang, Min
With the great advancements in large language models (LLMs), adversarial attacks against LLMs have recently attracted increasing attention. We found that pre-existing adversarial attack methodologies exhibit limited transferability and are notably inefficient, particularly when applied to LLMs. In this paper, we analyze the core mechanisms of previous predominant adversarial attack methods, revealing that 1) the distributions of importance score differ markedly among victim models, restricting the transferability; 2) the sequential attack processes induces substantial time overheads. Based on the above two insights, we introduce a new scheme, named TF-Attack, for Transferable and Fast adversarial attacks on LLMs. TF-Attack employs an external LLM as a third-party overseer rather than the victim model to identify critical units within sentences. Moreover, TF-Attack introduces the concept of Importance Level, which allows for parallel substitutions of attacks. We conduct extensive experiments on 6 widely adopted benchmarks, evaluating the proposed method through both automatic and human metrics. Results show that our method consistently surpasses previous methods in transferability and delivers significant speed improvements, up to 20 times faster than earlier attack strategies.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
BruSLeAttack: A Query-Efficient Score-Based Black-Box Sparse Adversarial Attack
Vo, Viet Quoc, Abbasnejad, Ehsan, Ranasinghe, Damith C.
We study the unique, less-well understood problem of generating sparse adversarial samples simply by observing the score-based replies to model queries. But, in contrast to query-based dense attack counterparts against black-box models, constructing sparse adversarial perturbations, even when models serve confidence score information to queries in a score-based setting, is non-trivial. Because, such an attack leads to: i) an NP-hard problem; and ii) a non-differentiable search space. Our artifacts and DIY attack samples are available on GitHub. Importantly, our work facilitates faster evaluation of model vulnerabilities and raises our vigilance on the safety, security and reliability of deployed systems. We are amidst an increasing prevalence of deep neural networks in real-world systems. So, our ability to understand the safety and security of neural networks is critical to our trust in machine intelligence. We have heightened awareness of adversarial attacks (Szegedy et al., 2014)--crafting imperceptible perturbations in inputs to manipulate deep perception systems to produce erroneous decisions. Only, access to model decisions (labels) or confidence scores are possible. Thus, crafting adversarial examples in black-box query-based interactions with a model is both interesting and practical to consider. Since confidence scores expose more information compared to model decisions, we can expect fewer queries to elicit effective attacks and, consequently, the potential for developing attacks at scale under score-based settings. While dense attacks are widely explored, the success of sparse-attacks, especially under score-based settings, has drawn much less attention and remains less understood (Croce et al., 2022). This leads to our lack of knowledge of model vulnerabilities to sparse perturbation regimes. Why are Score-Based Sparse Attacks Hard? An image with ground-truth label Minibus is misclassified as a Warplane.
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
- Transportation > Ground > Road (0.34)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
Modeling Adversarial Attack on Pre-trained Language Models as Sequential Decision Making
Fang, Xuanjie, Cheng, Sijie, Liu, Yang, Wang, Wei
Pre-trained language models (PLMs) have been widely used to underpin various downstream tasks. However, the adversarial attack task has found that PLMs are vulnerable to small perturbations. Mainstream methods adopt a detached two-stage framework to attack without considering the subsequent influence of substitution at each step. In this paper, we formally model the adversarial attack task on PLMs as a sequential decision-making problem, where the whole attack process is sequential with two decision-making problems, i.e., word finder and word substitution. Considering the attack process can only receive the final state without any direct intermediate signals, we propose to use reinforcement learning to find an appropriate sequential attack path to generate adversaries, named SDM-Attack. Extensive experimental results show that SDM-Attack achieves the highest attack success rate with a comparable modification rate and semantic similarity to attack fine-tuned BERT. Furthermore, our analyses demonstrate the generalization and transferability of SDM-Attack. The code is available at https://github.com/fduxuan/SDM-Attack.
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
Identification of Systematic Errors of Image Classifiers on Rare Subgroups
Metzen, Jan Hendrik, Hutmacher, Robin, Hua, N. Grace, Boreiko, Valentyn, Zhang, Dan
Despite excellent average-case performance of many image classifiers, their performance can substantially deteriorate on semantically coherent subgroups of the data that were under-represented in the training data. These systematic errors can impact both fairness for demographic minority groups as well as robustness and safety under domain shift. A major challenge is to identify such subgroups with subpar performance when the subgroups are not annotated and their occurrence is very rare. We leverage recent advances in text-to-image models and search in the space of textual descriptions of subgroups ("prompts") for subgroups where the target model has low performance on the prompt-conditioned synthesized data. To tackle the exponentially growing number of subgroups, we employ combinatorial testing. We denote this procedure as PromptAttack as it can be interpreted as an adversarial attack in a prompt space. We study subgroup coverage and identifiability with PromptAttack in a controlled setting and find that it identifies systematic errors with high accuracy. Thereupon, we apply PromptAttack to ImageNet classifiers and identify novel systematic errors on rare subgroups.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States > New York (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Africa (0.04)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (0.96)
- Transportation > Freight & Logistics Services (0.93)
- Information Technology (0.66)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Reevaluating Adversarial Examples in Natural Language
Morris, John X., Lifland, Eli, Lanchantin, Jack, Ji, Yangfeng, Qi, Yanjun
State-of-the-art attacks on NLP models lack a shared definition of a what constitutes a successful attack. We distill ideas from past work into a unified framework: a successful natural language adversarial example is a perturbation that fools the model and follows some linguistic constraints. We then analyze the outputs of two state-of-the-art synonym substitution attacks. We find that their perturbations often do not preserve semantics, and 38% introduce grammatical errors. Human surveys reveal that to successfully preserve semantics, we need to significantly increase the minimum cosine similarities between the embeddings of swapped words and between the sentence encodings of original and perturbed sentences.With constraints adjusted to better preserve semantics and grammaticality, the attack success rate drops by over 70 percentage points.
- North America > Canada (0.04)
- North America > United States > Virginia (0.04)
- Oceania > Australia (0.04)
- (6 more...)
- Questionnaire & Opinion Survey (0.94)
- Research Report (0.82)
- Information Technology > Security & Privacy (0.68)
- Media > Film (0.46)
- Leisure & Entertainment (0.46)
An Efficient and Margin-Approaching Zero-Confidence Adversarial Attack
Zhang, Yang, Chang, Shiyu, Yu, Mo, Qian, Kaizhi
There are two major paradigms of white-box adversarial attacks that attempt to impose input perturbations. The first paradigm, called the fix-perturbation attack, crafts adversarial samples within a given perturbation level. The second paradigm, called the zero-confidence attack, finds the smallest perturbation needed to cause mis-classification, also known as the margin of an input feature. While the former paradigm is well-resolved, the latter is not. Existing zero-confidence attacks either introduce significant ap-proximation errors, or are too time-consuming. We therefore propose MARGINATTACK, a zero-confidence attack framework that is able to compute the margin with improved accuracy and efficiency. Our experiments show that MARGINATTACK is able to compute a smaller margin than the state-of-the-art zero-confidence attacks, and matches the state-of-the-art fix-perturbation at-tacks. In addition, it runs significantly faster than the Carlini-Wagner attack, currently the most ac-curate zero-confidence attack algorithm.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- Asia (0.04)