Goto

Collaborating Authors

 Nadeau, Max


Circuit Breaking: Removing Model Behaviors with Targeted Ablation

arXiv.org Artificial Intelligence

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.


Benchmarks for Detecting Measurement Tampering

arXiv.org Artificial Intelligence

When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measurement tampering detection techniques on large language models. Concretely, given sets of text inputs and measurements aimed at determining if some outcome occurred, as well as a base model able to accurately predict measurements, the goal is to determine if examples where all measurements indicate the outcome occurred actually had the outcome occur, or if this was caused by measurement tampering. We demonstrate techniques that outperform simple baselines on most datasets, but don't achieve maximum performance. We believe there is significant room for improvement for both techniques and datasets, and we are excited for future work tackling measurement tampering.


Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

arXiv.org Artificial Intelligence

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.


Discovering Variable Binding Circuitry with Desiderata

arXiv.org Artificial Intelligence

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.


One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features

arXiv.org Artificial Intelligence

It is well understood that modern deep networks are vulnerable to adversarial attacks. However, conventional methods fail to produce adversarial perturbations that are intelligible to humans, and they pose limited threats in the physical world. To study feature-class associations in networks and better understand the realworld threats they face, we develop feature-level adversarial perturbations using deep image generators and a novel optimization objective. We show that they are versatile and use them to generate targeted feature-level attacks at the ImageNet scale that are simultaneously interpretable, universal to any source image, and physically-realizable. These attacks can also reveal spurious, semantically-describable feature/class associations, and we use them to guide the design of "copy/paste" adversaries in which one natural image is pasted into another to cause a targeted misclassification. State-of-the-art neural networks are vulnerable to adversarial inputs, which cause the network to fail yet only differ from benign inputs in subtle ways. Adversaries for visual classifiers conventionally take the form of a small-norm perturbation to a benign source image that causes misclassification (Szegedy et al., 2013; Goodfellow et al., 2014). These are effective, but to a human, these perturbations typically appear as random or mildly-textured noise. As such, analyzing these adversaries does not reveal information about the network relevant to how it will function-and how it may fail-when presented with human-interpretable features. Another limitation with conventional adversaries is that they do not tend to be physically-realizable. While they can retain some effectiveness when printed and photographed in a controlled setting (Kurakin et al., 2016), they are generally ineffective in less controlled ones such as those experienced by autonomous vehicles (Kong et al., 2020). Several works discussed in Section 2 have aimed to produce adversarial modifications that are universal to any source image, interpretable, or physically-realizable.