Xiang, Chong
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Wu, Tong, Zhang, Shujian, Song, Kaiqiang, Xu, Silei, Zhao, Sanqiang, Agrawal, Ravi, Indurthi, Sathish Reddy, Xiang, Chong, Mittal, Prateek, Zhou, Wenxuan
Large Language Models (LLMs) are susceptible to security and safety threats, such as prompt injection, prompt extraction, and harmful requests. One major cause of these vulnerabilities is the lack of an instruction hierarchy. Modern LLM architectures treat all inputs equally, failing to distinguish between and prioritize various types of instructions, such as system messages, user prompts, and data. As a result, lower-priority user prompts may override more critical system instructions, including safety protocols. Existing approaches to achieving instruction hierarchy, such as delimiters and instruction-based training, do not address this issue at the architectural level. We introduce the Instructional Segment Embedding (ISE) technique, inspired by BERT, to modern large language models, which embeds instruction priority information directly into the model. This approach enables models to explicitly differentiate and prioritize various instruction types, significantly improving safety against malicious prompts that attempt to override priority rules. Our experiments on the Structured Query and Instruction Hierarchy benchmarks demonstrate an average robust accuracy increase of up to 15.75% and 18.68%, respectively. Furthermore, we observe an improvement in instructionfollowing capability of up to 4.1% evaluated on AlpacaEval. Overall, our approach offers a promising direction for enhancing the safety and effectiveness of LLM architectures. Large Language Models (LLMs) have shown significant potential in enabling sophisticated agentic applications and facilitating autonomous decision-making across various domains, such as web agents, educational tools, medical assistance, and more (Yao et al., 2022; Gan et al., 2023; Abbasian et al., 2024). To optimize the use of AI applications, a structured approach to implementation is widely adopted. This involves clear distinctions among system instructions, user prompts, and data inputs, as illustrated in Figure 1. These instructions contain specific priorities that help the model execute functionalities correctly and better assist users. Modern LLMs process text without formal mechanisms to differentiate and prioritize instructions. Consequently, Figure 1: A demonstration of the hierarchy malicious attackers can easily exploit this limitation to of instructions, including system override priority roles, leading to various vulnerabilities. Prompt extraction (Zhang et al., 2024) aim to extract system messages, revealing proprietary prompts.
Certifiably Robust RAG against Retrieval Corruption
Xiang, Chong, Wu, Tong, Zhong, Zexuan, Wagner, David, Chen, Danqi, Mittal, Prateek
Retrieval-augmented generation (RAG) has been shown vulnerable to retrieval corruption attacks: an attacker can inject malicious passages into retrieval results to induce inaccurate responses. In this paper, we propose RobustRAG as the first defense framework against retrieval corruption attacks. The key insight of RobustRAG is an isolate-then-aggregate strategy: we get LLM responses from each passage in isolation and then securely aggregate these isolated responses. To instantiate RobustRAG, we design keyword-based and decoding-based algorithms for securely aggregating unstructured text responses. Notably, RobustRAG can achieve certifiable robustness: we can formally prove and certify that, for certain queries, RobustRAG can always return accurate responses, even when the attacker has full knowledge of our defense and can arbitrarily inject a small number of malicious passages. We evaluate RobustRAG on open-domain QA and long-form text generation datasets and demonstrate its effectiveness and generalizability across various tasks and datasets.
Position Paper: Beyond Robustness Against Single Attack Types
Dai, Sihui, Xiang, Chong, Wu, Tong, Mittal, Prateek
Current research on defending against adversarial examples focuses primarily on achieving robustness against a single attack type such as $\ell_2$ or $\ell_{\infty}$-bounded attacks. However, the space of possible perturbations is much larger and currently cannot be modeled by a single attack type. The discrepancy between the focus of current defenses and the space of attacks of interest calls to question the practicality of existing defenses and the reliability of their evaluation. In this position paper, we argue that the research community should look beyond single attack robustness, and we draw attention to three potential directions involving robustness against multiple attacks: simultaneous multiattack robustness, unforeseen attack robustness, and a newly defined problem setting which we call continual adaptive robustness. We provide a unified framework which rigorously defines these problem settings, synthesize existing research in these fields, and outline open directions. We hope that our position paper inspires more research in simultaneous multiattack, unforeseen attack, and continual adaptive robustness.
MultiRobustBench: Benchmarking Robustness Against Multiple Attacks
Dai, Sihui, Mahloujifar, Saeed, Xiang, Chong, Sehwag, Vikash, Chen, Pin-Yu, Mittal, Prateek
The bulk of existing research in defending against adversarial examples focuses on defending against a single (typically bounded Lp-norm) attack, but for a practical setting, machine learning (ML) models should be robust to a wide variety of attacks. In this paper, we present the first unified framework for considering multiple attacks against ML models. Our framework is able to model different levels of learner's knowledge about the test-time adversary, allowing us to model robustness against unforeseen attacks and robustness against unions of attacks. Using our framework, we present the first leaderboard, MultiRobustBench, for benchmarking multiattack evaluation which captures performance across attack types and attack strengths. We evaluate the performance of 16 defended models for robustness against a set of 9 different attack types, including Lp-based threat models, spatial transformations, and color changes, at 20 different attack strengths (180 attacks total). Additionally, we analyze the state of current defenses against multiple attacks. Our analysis shows that while existing defenses have made progress in terms of average robustness across the set of attacks used, robustness against the worst-case attack is still a big open problem as all existing models perform worse than random guessing.
PatchGuard: A Provably Robust Defense against Adversarial Patches via Small Receptive Fields and Masking
Xiang, Chong, Bhagoji, Arjun Nitin, Sehwag, Vikash, Mittal, Prateek
Localized adversarial patches aim to induce misclassification in machine learning models by arbitrarily modifying pixels within a restricted region of an image. Such attacks can be realized in the physical world by attaching the adversarial patch to the object to be misclassified, and defending against such attacks is an unsolved/open problem. In this paper, we propose a general defense framework called PatchGuard that can achieve high provable robustness while maintaining high clean accuracy against localized adversarial patches. The cornerstone of PatchGuard involves the use of CNNs with small receptive fields to impose a bound on the number of features corrupted by an adversarial patch. Given a bounded number of corrupted features, the problem of designing an adversarial patch defense reduces to that of designing a secure feature aggregation mechanism. Towards this end, we present our robust masking defense that robustly detects and masks corrupted features to recover the correct prediction. Our extensive evaluation on ImageNet, ImageNette (a 10-class subset of ImageNet), and CIFAR-10 datasets demonstrates that our defense achieves state-of-the-art performance in terms of both provable robust accuracy and clean accuracy.