AITopics | Dietz, Florian

Collaborating Authors

Dietz, Florian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Auditing language models for hidden objectives

Marks, Samuel, Treutlein, Johannes, Bricken, Trenton, Lindsey, Jack, Marcus, Jonathan, Mishra-Sharma, Siddharth, Ziegler, Daniel, Ameisen, Emmanuel, Batson, Joshua, Belonax, Tim, Bowman, Samuel R., Carter, Shan, Chen, Brian, Cunningham, Hoagy, Denison, Carson, Dietz, Florian, Golechha, Satvik, Khan, Akbir, Kirchner, Jan, Leike, Jan, Meek, Austin, Nishimura-Gasparian, Kei, Ong, Euan, Olah, Christopher, Pearce, Adam, Roger, Fabien, Salle, Jeanne, Shih, Andy, Tong, Meg, Thomas, Drake, Rivoire, Kelley, Jermyn, Adam, MacDiarmid, Monte, Henighan, Tom, Hubinger, Evan

arXiv.org Artificial IntelligenceMar-13-2025

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

artificial intelligence, auditing language model, machine learning, (1 more...)

arXiv.org Artificial Intelligence

2503.10965

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

Add feedback

IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently

Dietz, Florian, Klakow, Dietrich

arXiv.org Artificial IntelligenceDec-31-2024

Solving arithmetic tasks is a simple and fundamental skill, yet modern Large Language Models (LLMs) have great difficulty with them. We introduce the Integrated Gated Calculator (IGC), a module that enables LLMs to perform arithmetic by emulating a calculator on the GPU. We finetune a Llama model with our module and test it on the BigBench Arithmetic benchmark, where it beats the State of the Art, outperforming all models on the benchmark, including models almost two orders of magnitude larger. Our approach takes only a single iteration to run and requires no external tools. It performs arithmetic operations entirely inside the LLM without the need to produce intermediate tokens. It is computationally efficient, interpretable, and avoids side-effects on tasks that do not require arithmetic operations. It reliably achieves 98\% to 99\% accuracy across multiple training runs and for all subtasks, including the substantially harder subtask of multiplication, which was previously unsolved.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2501.00684

Country: Europe > Germany (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Enhancing Reinforcement Learning with discrete interfaces to learn the Dyck Language

Dietz, Florian, Klakow, Dietrich

arXiv.org Artificial IntelligenceOct-27-2021

Even though most interfaces in the real world are discrete, no efficient way exists to train neural networks to make use of them, yet. We enhance an Interaction Network (a Reinforcement Learning architecture) with discrete interfaces and train it on the generalized Dyck language. This task requires an understanding of hierarchical structures to solve, and has long proven difficult for neural networks. We provide the first solution based on learning to use discrete data structures. We encountered unexpected anomalous behavior during training, and utilized pre-training based on execution traces to overcome them. The resulting model is very small and fast, and generalizes to sequences that are an entire order of magnitude longer than the training data.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2110.1435

Country: Europe > Germany (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Interaction Networks: Using a Reinforcement Learner to train other Machine Learning algorithms

Dietz, Florian

arXiv.org Artificial IntelligenceJun-15-2020

The wiring of neurons in the brain is more flexible than the wiring of connections in contemporary artificial neural networks. It is possible that this extra flexibility is important for efficient problem solving and learning. This paper introduces the Interaction Network. Interaction Networks aim to capture some of this extra flexibility. An Interaction Network consists of a collection of conventional neural networks, a set of memory locations, and a DQN or other reinforcement learner. The DQN decides when each of the neural networks is executed, and on what memory locations. In this way, the individual neural networks can be trained on different data, for different tasks. At the same time, the results of the individual networks influence the decision process of the reinforcement learner. This results in a feedback loop that allows the DQN to perform actions that improve its own decision-making. Any existing type of neural network can be reproduced in an Interaction Network in its entirety, with only a constant computational overhead. Interaction Networks can then introduce additional features to improve performance further. These make the algorithm more flexible and general, but at the expense of being harder to train. In this paper, thought experiments are used to explore how the additional abilities of Interaction Networks could be used to improve various existing types of neural networks. Several experiments have been run to prove that the concept is sound. These show that the basic idea works, but they also reveal a number of challenges that do not appear in conventional neural networks, which make Interaction Networks very hard to train. Further research needs to be done to alleviate these issues. A number of promising avenues of research to achieve this are outlined in this paper.

deep learning, interaction network, neural network, (20 more...)

arXiv.org Artificial Intelligence

2006.08457

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback