Che, Zora
PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models
Panaitescu-Liess, Michael-Andrei, Pathmanathan, Pankayaraj, Kaya, Yigitcan, Che, Zora, An, Bang, Zhu, Sicheng, Agrawal, Aakriti, Huang, Furong
As the capabilities of large language models (LLMs) continue to expand, their usage has become increasingly prevalent. However, as reflected in numerous ongoing lawsuits regarding LLM-generated content, addressing copyright infringement remains a significant challenge. In this paper, we introduce PoisonedParrot: the first stealthy data poisoning attack that induces an LLM to generate copyrighted content even when the model has not been directly trained on the specific copyrighted material. PoisonedParrot integrates small fragments of copyrighted text into the poison samples using an off-the-shelf LLM. Despite its simplicity, evaluated in a wide range of experiments, PoisonedParrot is surprisingly effective at priming the model to generate copyrighted content with no discernible side effects. Moreover, we discover that existing defenses are largely ineffective against our attack. Finally, we make the first attempt at mitigating copyright-infringement poisoning attacks by proposing a defense: ParrotTrap. We encourage the community to explore this emerging threat model further.
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Che, Zora, Casper, Stephen, Kirk, Robert, Satheesh, Anirudh, Slocum, Stewart, McKinney, Lev E, Gandikota, Rohit, Ewart, Aidan, Rosati, Domenic, Wu, Zichu, Cai, Zikui, Chughtai, Bilal, Gal, Yarin, Huang, Furong, Hadfield-Menell, Dylan
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations than input-space attacks alone. We release models at https://huggingface.co/LLM-GAT
EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?
Agrawal, Aakriti, Ding, Mucong, Che, Zora, Deng, Chenghao, Satheesh, Anirudh, Langford, John, Huang, Furong
How can we harness the collective capabilities of multiple Large Language Models (LLMs) to create an even more powerful model? This question forms the foundation of our research, where we propose an innovative approach to weak-to-strong (w2s) generalization-a critical problem in AI alignment. Our work introduces an easy-to-hard (e2h) framework for studying the feasibility of w2s generalization, where weak models trained on simpler tasks collaboratively supervise stronger models on more complex tasks. This setup mirrors real-world challenges, where direct human supervision is limited. To achieve this, we develop a novel AdaBoost-inspired ensemble method, demonstrating that an ensemble of weak supervisors can enhance the performance of stronger LLMs across classification and generative tasks on difficult QA datasets. In several cases, our ensemble approach matches the performance of models trained on ground-truth data, establishing a new benchmark for w2s generalization. We observe an improvement of up to 14% over existing baselines and average improvements of 5% and 4% for binary classification and generative tasks, respectively. This research points to a promising direction for enhancing AI through collective supervision, especially in scenarios where labeled data is sparse or insufficient.
Auction-Based Regulation for Artificial Intelligence
Bornstein, Marco, Che, Zora, Julapalli, Suhas, Mohamed, Abdirisak, Bedi, Amrit Singh, Huang, Furong
In an era of "moving fast and breaking things", regulators have moved slowly to pick up the safety, bias, and legal pieces left in the wake of broken Artificial Intelligence (AI) deployment. Since AI models, such as large language models, are able to push misinformation and stoke division within our society, it is imperative for regulators to employ a framework that mitigates these dangers and ensures user safety. While there is much-warranted discussion about how to address the safety, bias, and legal woes of state-of-the-art AI models, the number of rigorous and realistic mathematical frameworks to regulate AI safety is lacking. We take on this challenge, proposing an auction-based regulatory mechanism that provably incentivizes model-building agents (i) to deploy safer models and (ii) to participate in the regulation process. We provably guarantee, via derived Nash Equilibria, that each participating agent's best strategy is to submit a model safer than a prescribed minimum-safety threshold. Empirical results show that our regulatory auction boosts safety and participation rates by 20% and 15% respectively, outperforming simple regulatory frameworks that merely enforce minimum safety standards.
SAIL: Self-Improving Efficient Online Alignment of Large Language Models
Ding, Mucong, Chakraborty, Souradip, Agrawal, Vibhu, Che, Zora, Koppel, Alec, Wang, Mengdi, Bedi, Amrit, Huang, Furong
As artificial intelligence (AI) systems surpass human capabilities in various tasks, ensuring alignment with human values and ethics is crucial. This is especially important for large language models (LLMs), which are trained on diverse datasets that may contain harmful content. Reinforcement Learning from Human Feedback (RLHF) is an effective method for AI alignment, with models like OpenAI's GPT-4, Google's Gemini, and Anthropic Claude showing safe and aligned behaviors. However, the vast majority of the current research in RLHF (Agarwal et al., 2020; Rafailov et al., 2023; Ouyang et al., 2022; Chakraborty et al., 2024; Swamy et al., 2024) focuses on the offline setting, which involves using a fixed dataset of responses generated by the supervised fine-tuned model (SFT), ranked by human experts. Consequently, these methods are inherently offline and heavily reliant on the quality of the offline data generated by the SFT model, which exhibits drawbacks such as insufficient coverage of response-query pairs leading to sub-optimal alignment. To deal with the above shortcomings, recent literature (Guo et al., 2024a; Sharma et al., 2024; Lee et al., 2023; Yuan et al., 2024b) has focused on designing online RLHF algorithms. The setting of online RLHF transcends the constraints of a static offline dataset and aims to address two critical questions: Q1: How should we generate new responses during fine-tuning?