Evolution of SAE Features Across Layers in LLMs

Balcells, Daniel, Lerner, Benjamin, Oesterle, Michael, Ucar, Ediz, Heimersheim, Stefan

Nov-17-2024–arXiv.org Artificial Intelligence

Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors, and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

large language model, machine learning, similarity measure, (22 more...)

arXiv.org Artificial Intelligence

Nov-17-2024

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - Quebec > Montreal (0.04)
  - British Columbia (0.04)
- Europe
  - Estonia (0.04)
  - Netherlands > South Holland
    - Leiden (0.05)
- Asia > Middle East
  - Yemen (0.04)

Genre:
- Research Report (0.82)

Industry:
- Law (0.68)
- Leisure & Entertainment > Sports (0.67)
- Government (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)