Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Dan Braun Jordan Taylor Nicholas Goldowsky-Dill Lee Sharkey

Nov-20-2025, 03:28:33 GMT–Neural Information Processing Systems

Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary that reconstructs a network's internal activations, have been used to identify these features. However, SAEs may learn more about the structure of the dataset than the computational structure of the network. There is therefore only indirect reason to believe that the directions found in these dictionaries are functionally important to the network. We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted.

activation, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Nov-20-2025, 03:28:33 GMT

Conferences PDF

Add feedback

Country:
- Oceania > Australia
  - Queensland (0.04)
- Asia > Middle East
  - Jordan (0.40)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Duplicate Docs Excel Report

Title
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Dan Braun Jordan Taylor Nicholas Goldowsky-Dill Lee Sharkey

Similar Docs Excel Report more

Title	Similarity	Source
None found