Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Laptev, Daniil, Balagansky, Nikita, Aksenov, Yaroslav, Gavrilov, Daniil

Feb-6-2025–arXiv.org Artificial Intelligence

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Feb-6-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Georgia > Fulton County
    - Atlanta (0.04)
  - Florida > Miami-Dade County
    - Miami (0.04)
- Europe > Russia
  - Central Federal District > Moscow Oblast > Moscow (0.04)

Genre:
- Research Report > New Finding (0.87)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found