defense strategy
- North America > United States > Massachusetts > Middlesex County > Waltham (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Transportation (0.70)
- Information Technology > Security & Privacy (0.65)
- Government > Military (0.41)
- North America > Canada (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Europe > Austria (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (2 more...)
Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack
Deep neural networks face persistent challenges in defending against backdoor attacks, leading to an ongoing battle between attacks and defenses. While existing backdoor defense strategies have shown promising performance on reducing attack success rates, can we confidently claim that the backdoor threat has truly been eliminated from the model? To address it, we re-investigate the characteristics of the backdoored models after defense (denoted as defense models). Surprisingly, we find that the original backdoors still exist in defense models derived from existing post-training defense strategies, and the backdoor existence is measured by a novel metric called backdoor existence coefficient. It implies that the backdoors just lie dormant rather than being eliminated. To further verify this finding, we empirically show that these dormant backdoors can be easily re-activated during inference stage, by manipulating the original trigger with well-designed tiny perturbation using universal adversarial attack. More practically, we extend our backdoor re-activation to black-box scenario, where the defense model can only be queried by the adversary during inference stage, and develop two effective methods, i.e., query-based and transfer-based backdoor re-activation attacks. The effectiveness of the proposed methods are verified on both image classification and multimodal contrastive learning (i.e., CLIP) tasks. In conclusion, this work uncovers a critical vulnerability that has never been explored in existing defense strategies, emphasizing the urgency of designing more robust and advanced backdoor defense mechanisms in the future.
MoGU: A Framework for Enhancing Safety of LLMs While Preserving Their Usability
Large Language Models (LLMs) are increasingly deployed in various applications. As their usage grows, concerns regarding their safety are rising, especially in maintaining harmless responses when faced with malicious instructions. Many defense strategies have been developed to enhance the safety of LLMs. However, our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance, thereby diminishing the usability of their responses to benign instructions. To solve this problem, we introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability. Our MoGU framework transforms the base LLM into two variants: the usable LLM and the safe LLM, and further employs dynamic routing to balance their contribution. When encountering malicious instructions, the router will assign a higher weight to the safe LLM to ensure that responses are harmless.
Reimagining Mutual Information for Enhanced Defense against Data Leakage in Collaborative Inference
Edge-cloud collaborative inference empowers resource-limited IoT devices to support deep learning applications without disclosing their raw data to the cloud server, thus protecting user's data. Nevertheless, prior research has shown that collaborative inference still results in the exposure of input and predictions from edge devices. To defend against such data leakage in collaborative inference, we introduce InfoScissors, a defense strategy designed to reduce the mutual information between a model's intermediate outcomes and the device's input and predictions. We evaluate our defense on several datasets in the context of diverse attacks. Besides the empirical comparison, we provide a theoretical analysis of the inadequacies of recent defense strategies that also utilize mutual information, particularly focusing on those based on the Variational Information Bottleneck (VIB) approach. We illustrate the superiority of our method and offer a theoretical analysis of it.
Trap and Replace: Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork
Deep neural networks (DNNs) are vulnerable to backdoor attacks. Previous works have shown it extremely challenging to unlearn the undesired backdoor behavior from the network, since the entire network can be affected by the backdoor samples. In this paper, we propose a brand-new backdoor defense strategy, which makes it much easier to remove the harmful influence of backdoor samples from the model. Our defense strategy, \emph{Trap and Replace}, consists of two stages.
- North America > United States > Massachusetts > Middlesex County > Waltham (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Transportation (0.70)
- Information Technology > Security & Privacy (0.65)
- Government > Military (0.41)
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
Goren, Gil, Katz, Shahar, Wolf, Lior
Large Language Models (LLMs) are vulnerable to adversarial attacks that bypass safety guidelines and generate harmful content. Mitigating these vulnerabilities requires defense mechanisms that are both robust and computationally efficient. However, existing approaches either incur high computational costs or rely on lightweight defenses that can be easily circumvented, rendering them impractical for real-world LLM-based systems. In this work, we introduce the AlignTree defense, which enhances model alignment while maintaining minimal computational overhead. AlignTree monitors LLM activations during generation and detects misaligned behavior using an efficient random forest classifier. This classifier operates on two signals: (i) the refusal direction -- a linear representation that activates on misaligned prompts, and (ii) an SVM-based signal that captures non-linear features associated with harmful content. Unlike previous methods, AlignTree does not require additional prompts or auxiliary guard models. Through extensive experiments, we demonstrate the efficiency and robustness of AlignTree across multiple LLMs and benchmarks.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > China > Hong Kong (0.04)
- Research Report > New Finding (0.68)
- Research Report > Promising Solution (0.46)
- North America > Canada (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)