Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

Dec-8-2025–arXiv.org Artificial Intelligence

Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparame-ter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features. It is theorized that Large Language Models (LLMs) represent concepts as linear directions in representation space, known as the Linear Representation Hypothesis (LRH) (Elhage et al., 2022; Park et al., 2024). These concepts are nearly orthogonal linear directions, allowing the LLM to represent many more concepts than there are neurons, a phenomenon known as superposition (Elhage et al., 2022). However, superposition poses a challenge for interpretability, as neurons in the LLM are polysemantic, firing on many different concepts. Sparse autoencoders (SAEs) are meant to reverse superposition, and extract interpretable, monose-mantic latent features (Cunningham et al., 2024; Bricken et al., 2023) using sparse dictionary learning (Olshausen & Field, 1997).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Dec-8-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found