Dietz, Florian
Auditing language models for hidden objectives
Marks, Samuel, Treutlein, Johannes, Bricken, Trenton, Lindsey, Jack, Marcus, Jonathan, Mishra-Sharma, Siddharth, Ziegler, Daniel, Ameisen, Emmanuel, Batson, Joshua, Belonax, Tim, Bowman, Samuel R., Carter, Shan, Chen, Brian, Cunningham, Hoagy, Denison, Carson, Dietz, Florian, Golechha, Satvik, Khan, Akbir, Kirchner, Jan, Leike, Jan, Meek, Austin, Nishimura-Gasparian, Kei, Ong, Euan, Olah, Christopher, Pearce, Adam, Roger, Fabien, Salle, Jeanne, Shih, Andy, Tong, Meg, Thomas, Drake, Rivoire, Kelley, Jermyn, Adam, MacDiarmid, Monte, Henighan, Tom, Hubinger, Evan
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.
IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently
Dietz, Florian, Klakow, Dietrich
Solving arithmetic tasks is a simple and fundamental skill, yet modern Large Language Models (LLMs) have great difficulty with them. We introduce the Integrated Gated Calculator (IGC), a module that enables LLMs to perform arithmetic by emulating a calculator on the GPU. We finetune a Llama model with our module and test it on the BigBench Arithmetic benchmark, where it beats the State of the Art, outperforming all models on the benchmark, including models almost two orders of magnitude larger. Our approach takes only a single iteration to run and requires no external tools. It performs arithmetic operations entirely inside the LLM without the need to produce intermediate tokens. It is computationally efficient, interpretable, and avoids side-effects on tasks that do not require arithmetic operations. It reliably achieves 98\% to 99\% accuracy across multiple training runs and for all subtasks, including the substantially harder subtask of multiplication, which was previously unsolved.
Enhancing Reinforcement Learning with discrete interfaces to learn the Dyck Language
Dietz, Florian, Klakow, Dietrich
Even though most interfaces in the real world are discrete, no efficient way exists to train neural networks to make use of them, yet. We enhance an Interaction Network (a Reinforcement Learning architecture) with discrete interfaces and train it on the generalized Dyck language. This task requires an understanding of hierarchical structures to solve, and has long proven difficult for neural networks. We provide the first solution based on learning to use discrete data structures. We encountered unexpected anomalous behavior during training, and utilized pre-training based on execution traces to overcome them. The resulting model is very small and fast, and generalizes to sequences that are an entire order of magnitude longer than the training data.
Interaction Networks: Using a Reinforcement Learner to train other Machine Learning algorithms
Dietz, Florian
The wiring of neurons in the brain is more flexible than the wiring of connections in contemporary artificial neural networks. It is possible that this extra flexibility is important for efficient problem solving and learning. This paper introduces the Interaction Network. Interaction Networks aim to capture some of this extra flexibility. An Interaction Network consists of a collection of conventional neural networks, a set of memory locations, and a DQN or other reinforcement learner. The DQN decides when each of the neural networks is executed, and on what memory locations. In this way, the individual neural networks can be trained on different data, for different tasks. At the same time, the results of the individual networks influence the decision process of the reinforcement learner. This results in a feedback loop that allows the DQN to perform actions that improve its own decision-making. Any existing type of neural network can be reproduced in an Interaction Network in its entirety, with only a constant computational overhead. Interaction Networks can then introduce additional features to improve performance further. These make the algorithm more flexible and general, but at the expense of being harder to train. In this paper, thought experiments are used to explore how the additional abilities of Interaction Networks could be used to improve various existing types of neural networks. Several experiments have been run to prove that the concept is sound. These show that the basic idea works, but they also reveal a number of challenges that do not appear in conventional neural networks, which make Interaction Networks very hard to train. Further research needs to be done to alleviate these issues. A number of promising avenues of research to achieve this are outlined in this paper.