An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Dao, James, Lau, Yeu-Tong, Rager, Can, Janiak, Jett

arXiv.org Artificial Intelligence 

In recent years, large language models (LLMs) have made impressive gains in capability (Vaswani et al. 2017; Devlin et al. 2019; OpenAI 2023; Radford et al. 2019; Brown et al. 2020), often surpassing expectations (Wei et al. 2022). However, these models remain poorly understood, with their successes and failures largely unexplained. Understanding what LLMs learn and how they generate predictions is therefore an increasingly urgent scientific and practical challenge. Mechanistic interpretability (MI) aims to reverse engineer models into human-understandable algorithms or circuits (Geiger et al. 2021; Olah 2022; Wang et al. 2022), attempting to avoid pitfalls such as illusory understanding. With MI, we can identify and fix model errors (Vig et al. 2020; Hernandez et al. 2022; Meng et al. 2023; Hase et al. 2023), steer its outputs (Li et al. 2023), and explain emergent behaviors (Nanda et al. 2023; Barak et al. 2023). The central goals in MI are (a) localization: identifying the specific model components (attention heads, MLP layers) that the circuit is composed of; and (b) explaining the behavior of these components.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found