AITopics | Wei, Boyi

Collaborating Authors

Wei, Boyi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On Evaluating the Durability of Safeguards for Open-Weight LLMs

Qi, Xiangyu, Wei, Boyi, Carlini, Nicholas, Huang, Yangsibo, Xie, Tinghao, He, Luxi, Jagielski, Matthew, Nasr, Milad, Mittal, Prateek, Henderson, Peter

arXiv.org Artificial IntelligenceDec-9-2024

Stakeholders -- from model developers to policymakers -- seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards for open-weight LLMs that can withstand adversarial modifications of the model's weights via fine-tuning. This holds the promise of raising adversaries' costs even under strong threat models where adversaries can directly fine-tune model weights. However, in this paper, we urge for more careful characterization of the limits of these approaches. Through several case studies, we demonstrate that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We draw lessons from the evaluation pitfalls that we identify and suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models, which can provide more useful and candid assessments to stakeholders.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.07097

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Adversarial Perspective on Machine Unlearning for AI Safety

Łucki, Jakub, Wei, Boyi, Huang, Yangsibo, Henderson, Peter, Tramèr, Florian, Rando, Javier

arXiv.org Artificial IntelligenceNov-8-2024

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.18025

Country: Europe (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.92)
Health & Medicine > Therapeutic Area > Hematology (0.92)
Health & Medicine > Pharmaceuticals & Biotechnology (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)

Add feedback

Evaluating Copyright Takedown Methods for Language Models

Wei, Boyi, Shi, Weijia, Huang, Yangsibo, Smith, Noah A., Zhang, Chiyuan, Zettlemoyer, Luke, Li, Kai, Henderson, Peter

arXiv.org Artificial IntelligenceJul-11-2024

Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns for LMs, noting the conceptual similarity to (but legal distinction from) the DMCA takedown This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We propose CoTaEval, an evaluation framework to assess the effectiveness of copyright takedown methods, the impact on the model's ability to retain uncopyrightable factual knowledge from the training data whose recitation is embargoed, and how well the model maintains its general utility and efficiency. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches. Our findings indicate that no tested method excels across all metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2406.18664

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.66)

Industry:

Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (0.93)
Leisure & Entertainment > Sports > Baseball (0.93)
Government > Regional Government > North America Government > United States Government (0.68)
Law > Intellectual Property & Technology Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

AI Risk Management Should Incorporate Both Safety and Security

Qi, Xiangyu, Huang, Yangsibo, Zeng, Yi, Debenedetti, Edoardo, Geiping, Jonas, He, Luxi, Huang, Kaixuan, Madhushani, Udari, Sehwag, Vikash, Shi, Weijia, Wei, Boyi, Xie, Tinghao, Chen, Danqi, Chen, Pin-Yu, Ding, Jeffrey, Jia, Ruoxi, Ma, Jiaqi, Narayanan, Arvind, Su, Weijie J, Wang, Mengdi, Xiao, Chaowei, Li, Bo, Song, Dawn, Henderson, Peter, Mittal, Prateek

arXiv.org Artificial IntelligenceMay-29-2024

The exposure of security vulnerabilities in safety-aligned language models, e.g., susceptibility to adversarial attacks, has shed light on the intricate interplay between AI safety and AI security. Although the two disciplines now come together under the overarching goal of AI risk management, they have historically evolved separately, giving rise to differing perspectives. Therefore, in this paper, we advocate that stakeholders in AI risk management should be aware of the nuances, synergies, and interplay between safety and security, and unambiguously take into account the perspectives of both disciplines in order to devise mostly effective and holistic risk mitigation approaches. Unfortunately, this vision is often obfuscated, as the definitions of the basic concepts of "safety" and "security" themselves are often inconsistent and lack consensus across communities. With AI risk management being increasingly cross-disciplinary, this issue is particularly salient. In light of this conceptual challenge, we introduce a unified reference framework to clarify the differences and interplay between AI safety and AI security, aiming to facilitate a shared understanding and effective collaboration across communities.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2405.19524

Country:

North America > United States > California (0.14)
North America > United States > Wisconsin (0.14)
North America > United States > Illinois (0.14)
Europe > Austria > Vienna (0.14)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Government > Regional Government > Europe Government (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(6 more...)

Add feedback

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Wei, Boyi, Huang, Kaixuan, Huang, Yangsibo, Xie, Tinghao, Qi, Xiangyu, Xia, Mengzhou, Mittal, Prateek, Wang, Mengdi, Henderson, Peter

arXiv.org Artificial IntelligenceFeb-7-2024

Despite these efforts, recent studies have uncovered concerning'jailbreak' scenarios. In these cases, even well-aligned Large language models (LLMs) show inherent models have had their safeguards successfully breached (Albert, brittleness in their safety mechanisms, as evidenced 2023). These jailbreaks can include crafting adversarial by their susceptibility to jailbreaking and prompts (Wei et al., 2023; Jones et al., 2023; Carlini even non-malicious fine-tuning. This study explores et al., 2023; Zou et al., 2023b; Shen et al., 2023; Zhu et al., this brittleness of safety alignment by leveraging 2023; Qi et al., 2023), applying persuasion techniques (Zeng pruning and low-rank modifications. We develop et al., 2024), or manipulating the model's decoding process methods to identify critical regions that are (Huang et al., 2024). Recent studies show that finetuning vital for safety guardrails, and that are disentangled an aligned LLM, even on a non-malicious dataset, from utility-relevant regions at both the neuron can inadvertently weaken a model's safety mechanisms (Qi and rank levels. Surprisingly, the isolated regions et al., 2024; Yang et al., 2023; Zhan et al., 2023). Often, we find are sparse, comprising about 3% at these vulnerabilities apply to both open-access and closedaccess the parameter level and 2.5% at the rank level.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2402.05162

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback