Goto

Collaborating Authors

 Wei, Boyi


On Evaluating the Durability of Safeguards for Open-Weight LLMs

arXiv.org Artificial Intelligence

Stakeholders -- from model developers to policymakers -- seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards for open-weight LLMs that can withstand adversarial modifications of the model's weights via fine-tuning. This holds the promise of raising adversaries' costs even under strong threat models where adversaries can directly fine-tune model weights. However, in this paper, we urge for more careful characterization of the limits of these approaches. Through several case studies, we demonstrate that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We draw lessons from the evaluation pitfalls that we identify and suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models, which can provide more useful and candid assessments to stakeholders.


An Adversarial Perspective on Machine Unlearning for AI Safety

arXiv.org Artificial Intelligence

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.


Evaluating Copyright Takedown Methods for Language Models

arXiv.org Artificial Intelligence

Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns for LMs, noting the conceptual similarity to (but legal distinction from) the DMCA takedown This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We propose CoTaEval, an evaluation framework to assess the effectiveness of copyright takedown methods, the impact on the model's ability to retain uncopyrightable factual knowledge from the training data whose recitation is embargoed, and how well the model maintains its general utility and efficiency. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches. Our findings indicate that no tested method excels across all metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals.


AI Risk Management Should Incorporate Both Safety and Security

arXiv.org Artificial Intelligence

The exposure of security vulnerabilities in safety-aligned language models, e.g., susceptibility to adversarial attacks, has shed light on the intricate interplay between AI safety and AI security. Although the two disciplines now come together under the overarching goal of AI risk management, they have historically evolved separately, giving rise to differing perspectives. Therefore, in this paper, we advocate that stakeholders in AI risk management should be aware of the nuances, synergies, and interplay between safety and security, and unambiguously take into account the perspectives of both disciplines in order to devise mostly effective and holistic risk mitigation approaches. Unfortunately, this vision is often obfuscated, as the definitions of the basic concepts of "safety" and "security" themselves are often inconsistent and lack consensus across communities. With AI risk management being increasingly cross-disciplinary, this issue is particularly salient. In light of this conceptual challenge, we introduce a unified reference framework to clarify the differences and interplay between AI safety and AI security, aiming to facilitate a shared understanding and effective collaboration across communities.


Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

arXiv.org Artificial Intelligence

Despite these efforts, recent studies have uncovered concerning'jailbreak' scenarios. In these cases, even well-aligned Large language models (LLMs) show inherent models have had their safeguards successfully breached (Albert, brittleness in their safety mechanisms, as evidenced 2023). These jailbreaks can include crafting adversarial by their susceptibility to jailbreaking and prompts (Wei et al., 2023; Jones et al., 2023; Carlini even non-malicious fine-tuning. This study explores et al., 2023; Zou et al., 2023b; Shen et al., 2023; Zhu et al., this brittleness of safety alignment by leveraging 2023; Qi et al., 2023), applying persuasion techniques (Zeng pruning and low-rank modifications. We develop et al., 2024), or manipulating the model's decoding process methods to identify critical regions that are (Huang et al., 2024). Recent studies show that finetuning vital for safety guardrails, and that are disentangled an aligned LLM, even on a non-malicious dataset, from utility-relevant regions at both the neuron can inadvertently weaken a model's safety mechanisms (Qi and rank levels. Surprisingly, the isolated regions et al., 2024; Yang et al., 2023; Zhan et al., 2023). Often, we find are sparse, comprising about 3% at these vulnerabilities apply to both open-access and closedaccess the parameter level and 2.5% at the rank level.