Marks, Samuel
Auditing language models for hidden objectives
Marks, Samuel, Treutlein, Johannes, Bricken, Trenton, Lindsey, Jack, Marcus, Jonathan, Mishra-Sharma, Siddharth, Ziegler, Daniel, Ameisen, Emmanuel, Batson, Joshua, Belonax, Tim, Bowman, Samuel R., Carter, Shan, Chen, Brian, Cunningham, Hoagy, Denison, Carson, Dietz, Florian, Golechha, Satvik, Khan, Akbir, Kirchner, Jan, Leike, Jan, Meek, Austin, Nishimura-Gasparian, Kei, Ong, Euan, Olah, Christopher, Pearce, Adam, Roger, Fabien, Salle, Jeanne, Shih, Andy, Tong, Meg, Thomas, Drake, Rivoire, Kelley, Jermyn, Adam, MacDiarmid, Monte, Henighan, Tom, Hubinger, Evan
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
Karvonen, Adam, Rager, Can, Marks, Samuel, Nanda, Neel
Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units. However, a major bottleneck for SAE development has been the lack of high-quality performance metrics, with prior work largely relying on unsupervised proxies. In this work, we introduce a family of evaluations based on SHIFT, a downstream task from Marks et al. (Sparse Feature Circuits, 2024) in which spurious cues are removed from a classifier by ablating SAE features judged to be task-irrelevant by a human annotator. We adapt SHIFT into an automated metric of SAE quality; this involves replacing the human annotator with an LLM. Additionally, we introduce the Targeted Probe Perturbation (TPP) metric that quantifies an SAE's ability to disentangle similar concepts, effectively scaling SHIFT to a wider range of datasets. We apply both SHIFT and TPP to multiple open-source models, demonstrating that these metrics effectively differentiate between various SAE training hyperparameters and architectures.
Erasing Conceptual Knowledge from Language Models
Gandikota, Rohit, Feucht, Sheridan, Marks, Samuel, Bau, David
Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Denison, Carson, MacDiarmid, Monte, Barez, Fazl, Duvenaud, David, Kravec, Shauna, Marks, Samuel, Schiefer, Nicholas, Soklaski, Ryan, Tamkin, Alex, Kaplan, Jared, Shlegeris, Buck, Bowman, Samuel R., Perez, Ethan, Hubinger, Evan
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
Treutlein, Johannes, Choi, Dami, Betley, Jan, Anil, Cem, Marks, Samuel, Grosse, Roger Baker, Evans, Owain
One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Li, Nathaniel, Pan, Alexander, Gopal, Anjali, Yue, Summer, Berrios, Daniel, Gatti, Alice, Li, Justin D., Dombrowski, Ann-Kathrin, Goel, Shashwat, Phan, Long, Mukobi, Gabriel, Helm-Burger, Nathan, Lababidi, Rassin, Justen, Lennart, Liu, Andrew B., Chen, Michael, Barrass, Isabelle, Zhang, Oliver, Zhu, Xiaoyuan, Tamirisa, Rishub, Bharathi, Bhrugu, Khoja, Adam, Zhao, Zhenqi, Herbert-Voss, Ariel, Breuer, Cort B., Marks, Samuel, Patel, Oam, Zou, Andy, Mazeika, Mantas, Wang, Zifan, Oswal, Palash, Lin, Weiran, Hunt, Adam A., Tienken-Harder, Justin, Shih, Kevin Y., Talley, Kemper, Guan, John, Kaplan, Russell, Steneker, Ian, Campbell, David, Jokubaitis, Brad, Levinson, Alex, Wang, Jean, Qian, William, Karmakar, Kallol Krishna, Basart, Steven, Fitz, Stephen, Levine, Mindy, Kumaraguru, Ponnurangam, Tupakula, Uday, Varadharajan, Vijay, Wang, Ruoyu, Shoshitaishvili, Yan, Ba, Jimmy, Esvelt, Kevin M., Wang, Alexandr, Hendrycks, Dan
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Marks, Samuel, Rager, Can, Michaud, Eric J., Belinkov, Yonatan, Bau, David, Mueller, Aaron
We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors.
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Marks, Samuel, Tegmark, Max
Large Language Models (LLMs) have impressive capabilities, but are also prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we curate high-quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements. We also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques. Despite their impressive capabilities, large language models (LLMs) do not always output true text (Lin et al., 2022; Steinhardt, 2023; Park et al., 2023). In some cases, this is because they do not know better. In other cases, LLMs apparently know that statements are false but generate them anyway. For instance, Perez et al. (2022) demonstrate that LLM assistants output more falsehoods when prompted with the biography of a less-educated user. More starkly, OpenAI (2023) documents a case where a GPT-4-based agent gained a person's help in solving a CAPTCHA by lying about being a vision-impaired human. "I should not reveal that I am a robot," the agent wrote in an internal chain-of-thought scratchpad, "I should make up an excuse for why I cannot solve CAPTCHAs." We would like techniques which, given a language model M and a statement s, determine whether M believes s to be true (Christiano et al., 2021).
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Casper, Stephen, Davies, Xander, Shi, Claudia, Gilbert, Thomas Krendl, Scheurer, Jérémy, Rando, Javier, Freedman, Rachel, Korbak, Tomasz, Lindner, David, Freire, Pedro, Wang, Tony, Marks, Samuel, Segerie, Charbel-Raphaël, Carroll, Micah, Peng, Andi, Christoffersen, Phillip, Damani, Mehul, Slocum, Stewart, Anwar, Usman, Siththaranjan, Anand, Nadeau, Max, Michaud, Eric J., Pfau, Jacob, Krasheninnikov, Dmitrii, Chen, Xin, Langosco, Lauro, Hase, Peter, Bıyık, Erdem, Dragan, Anca, Krueger, David, Sadigh, Dorsa, Hadfield-Menell, Dylan
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.