adversarial text
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Dominican Republic (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (6 more...)
- Government (1.00)
- Information Technology > Security & Privacy (0.69)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Dominican Republic (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (6 more...)
- Government (1.00)
- Information Technology > Security & Privacy (0.69)
TRAPDOC: Deceiving LLM Users by Injecting Imperceptible Phantom Tokens into Documents
Jin, Hyundong, Sung, Sicheol, Park, Shinwoo, Baik, SeungYeop, Han, Yo-Sub
The reasoning, writing, text-editing, and retrieval capabilities of proprietary large language models (LLMs) have advanced rapidly, providing users with an ever-expanding set of functionalities. However, this growing utility has also led to a serious societal concern: the over-reliance on LLMs. In particular, users increasingly delegate tasks such as homework, assignments, or the processing of sensitive documents to LLMs without meaningful engagement. This form of over-reliance and misuse is emerging as a significant social issue. In order to mitigate these issues, we propose a method injecting imperceptible phantom tokens into documents, which causes LLMs to generate outputs that appear plausible to users but are in fact incorrect. Based on this technique, we introduce TRAPDOC, a framework designed to deceive over-reliant LLM users. Through empirical evaluation, we demonstrate the effectiveness of our framework on proprietary LLMs, comparing its impact against several baselines. TRAPDOC serves as a strong foundation for promoting more responsible and thoughtful engagement with language models. Our code is available at https://github.com/jindong22/TrapDoc.
- Europe > France (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation
Geng, Runpeng, Wang, Yanting, Chen, Ying, Jia, Jinyuan
Retrieval-augmented generation (RAG) systems are widely deployed in real-world applications in diverse domains such as finance, healthcare, and cybersecurity. However, many studies showed that they are vulnerable to knowledge corruption attacks, where an attacker can inject adversarial texts into the knowledge database of a RAG system to induce the LLM to generate attacker-desired outputs. Existing studies mainly focus on attacking specific queries or queries with similar topics (or keywords). In this work, we propose UniC-RAG, a universal knowledge corruption attack against RAG systems. Unlike prior work, UniC-RAG jointly optimizes a small number of adversarial texts that can simultaneously attack a large number of user queries with diverse topics and domains, enabling an attacker to achieve various malicious objectives, such as directing users to malicious websites, triggering harmful command execution, or launching denial-of-service attacks. We formulate UniC-RAG as an optimization problem and further design an effective solution to solve it, including a balanced similarity-based clustering method to enhance the attack's effectiveness. Our extensive evaluations demonstrate that UniC-RAG is highly effective and significantly outperforms baselines. For instance, UniC-RAG could achieve over 90% attack success rate by injecting 100 adversarial texts into a knowledge database with millions of texts to simultaneously attack a large set of user queries (e.g., 2,000). Additionally, we evaluate existing defenses and show that they are insufficient to defend against UniC-RAG, highlighting the need for new defense mechanisms in RAG systems.
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Shu, Huizhen, Li, Xuying, Wang, Qirui, Kosuga, Yuji, Tian, Mengqiu, Li, Zhuo
With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.
Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation
Edemacu, Kennedy, Shashidhar, Vinay M., Tuape, Micheal, Abudu, Dan, Jang, Beakcheol, Kim, Jong Wook
--Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker chosen response for a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems. KEY challenge associated with large language models (LLMs) [1]-[3] is their tendency of becoming outdated and struggling to integrate the most recent knowledge [4], [5]. This fundamental short-coming is addressed by the recent emergency of retrieval-augmented generation (RAG) [6]-[9].
- North America > United States > New York > Richmond County > New York City (0.04)
- Europe > Finland > South Karelia > Lappeenranta (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- (4 more...)
Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks
Zhang, Xiaomei, Zhang, Zhaoxi, Zhang, Yanjun, Zheng, Xufei, Zhang, Leo Yu, Hu, Shengshan, Pan, Shirui
--T extual adversarial examples pose serious threats to the reliability of natural language processing systems. Recent studies suggest that adversarial examples tend to deviate from the underlying manifold of normal texts, whereas pre-trained masked language models can approximate the manifold of normal data. These findings inspire the exploration of masked language models for detecting textual adversarial attacks. We first introduce Masked Language Model-based Detection (MLMD), leveraging the mask and unmask operations of the masked language modeling (MLM) objective to induce the difference in manifold changes between normal and adversarial texts. Although MLMD achieves competitive detection performance, its exhaustive one-by-one masking strategy introduces significant computational overhead. Our posterior analysis reveals that a significant number of non-keywords in the input are not important for detection but consume resources. Building on this, we introduce Gradient-guided MLMD (GradMLMD), which leverages gradient information to identify and skip non-keywords during detection, significantly reducing resource consumption without compromising detection performance. Extensive experiments show that GradMLMD maintains comparable or better performance than MLMD and outperforms existing detectors. Among defenses based on the off-manifold conjecture, GradMLMD presents a novel method for capturing manifold changes and provides a practical solution for real-world application challenges. Index T erms --NLP, adversarial attack, adversarial defense, masked language model. L THOUGH advanced deep neural networks have the potential to revolutionize the performance of myriad natural language processing (NLP) tasks [1-3], they are highly vulnerable to adversarial attacks [4-7]. Through carefully manipulated inputs, attackers can drive models to produce erroneous outputs to their advantage. Many researchers have focused on introducing adversarial perturbations into the input by altering entire sentences. However, predominant efforts have been made to develop attacks at the word-level and character-level [8-14]. Correspondence to Dr. L. Zhang and Prof. X. Zheng Xiaomei Zhang, Leo Y u Zhang and Shirui Pan are with the School of Information and Communication Technology, Griffith University, Queensland, Australia (e-mail: xiaomei.zhang@griffithuni.edu.au, Zhaoxi Zhang and Y anjun Zhang are with the School of Computer Science, University of Technology Sydney, Sydney, New South Wales, Australia (email: Zhaoxi.Zhang-1@student.uts.edu.au, Xufei Zheng is with the College of Computer and Information Science, Southwest University, Chongqing, China (e-mail: zxufei@swu.edu.cn).
- Oceania > Australia > Queensland (0.24)
- Oceania > Australia > New South Wales > Sydney (0.24)
- Asia > China > Chongqing Province > Chongqing (0.24)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual Similarity
Cao, Xi, Gesang, Quzong, Sun, Yuan, Qun, Nuo, Nyima, Tashi
Language models based on deep neural networks are vulnerable to textual adversarial attacks. While rich-resource languages like English are receiving focused attention, Tibetan, a cross-border language, is gradually being studied due to its abundant ancient literature and critical language strategy. Currently, there are several Tibetan adversarial text generation methods, but they do not fully consider the textual features of Tibetan script and overestimate the quality of generated adversarial texts. To address this issue, we propose a novel Tibetan adversarial text generation method called TSCheater, which considers the characteristic of Tibetan encoding and the feature that visually similar syllables have similar semantics. This method can also be transferred to other abugidas, such as Devanagari script. We utilize a self-constructed Tibetan syllable visual similarity database called TSVSDB to generate substitution candidates and adopt a greedy algorithm-based scoring mechanism to determine substitution order. After that, we conduct the method on eight victim language models. Experimentally, TSCheater outperforms existing methods in attack effectiveness, perturbation magnitude, semantic similarity, visual similarity, and human acceptance. Finally, we construct the first Tibetan adversarial robustness evaluation benchmark called AdvTS, which is generated by existing methods and proofread by humans.
Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script
Cao, Xi, Sun, Yuan, Li, Jiajun, Gesang, Quzong, Qun, Nuo, Nyima, Tashi
DNN-based language models perform excellently on various tasks, but even SOTA LLMs are susceptible to textual adversarial attacks. Adversarial texts play crucial roles in multiple subfields of NLP. However, current research has the following issues. (1) Most textual adversarial attack methods target rich-resourced languages. How do we generate adversarial texts for less-studied languages? (2) Most textual adversarial attack methods are prone to generating invalid or ambiguous adversarial texts. How do we construct high-quality adversarial robustness benchmarks? (3) New language models may be immune to part of previously generated adversarial texts. How do we update adversarial robustness benchmarks? To address the above issues, we introduce HITL-GAT, a system based on a general approach to human-in-the-loop generation of adversarial texts. HITL-GAT contains four stages in one pipeline: victim model construction, adversarial example generation, high-quality benchmark construction, and adversarial robustness evaluation. Additionally, we utilize HITL-GAT to make a case study on Tibetan script which can be a reference for the adversarial research of other less-studied languages.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (20 more...)
- Information Technology > Security & Privacy (0.78)
- Government > Military (0.78)
SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments
Cao, Yue, Xing, Yun, Zhang, Jie, Lin, Di, Zhang, Tianwei, Tsang, Ivor, Liu, Yang, Guo, Qing
Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications. Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulnerabilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.
- North America > Canada > Alberta (0.14)
- Asia > Singapore (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)