An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Dao, James, Lau, Yeu-Tong, Rager, Can, Janiak, Jett

Nov-9-2023–arXiv.org Artificial Intelligence

In recent years, large language models (LLMs) have made impressive gains in capability (Vaswani et al. 2017; Devlin et al. 2019; OpenAI 2023; Radford et al. 2019; Brown et al. 2020), often surpassing expectations (Wei et al. 2022). However, these models remain poorly understood, with their successes and failures largely unexplained. Understanding what LLMs learn and how they generate predictions is therefore an increasingly urgent scientific and practical challenge. Mechanistic interpretability (MI) aims to reverse engineer models into human-understandable algorithms or circuits (Geiger et al. 2021; Olah 2022; Wang et al. 2022), attempting to avoid pitfalls such as illusory understanding. With MI, we can identify and fix model errors (Vig et al. 2020; Hernandez et al. 2022; Meng et al. 2023; Hase et al. 2023), steer its outputs (Li et al. 2023), and explain emergent behaviors (Nanda et al. 2023; Barak et al. 2023). The central goals in MI are (a) localization: identifying the specific model components (attention heads, MLP layers) that the circuit is composed of; and (b) explaining the behavior of these components.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-9-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Research Report (0.54)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found