Goto

Collaborating Authors

 knockoff




Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control

Enkhbayar, Tsogt-Ochir

arXiv.org Artificial Intelligence

In artificial intelligence research, comprehending the internal representations of a large language model is still a fundamental challenge [Olah et al., 2020]. Neural network activations can now be broken down into interpretable features using sparse autoencoders (SAEs) [Cunningham et al., 2023, Templeton et al., 2024]. SAEs seek to deconstruct polysemantic neurons into monosemantic features that correlate to concepts that are comprehensible to humans by learning overcomplete sparse representations of model activations. Finding SAE features and confirming their legitimacy are not the same thing, though. The methods used in most interpretability research today are correlation with downstream tasks, automated explanation scoring, or manual inspection. These methods are unable to differentiate between real computational patterns and spurious correlations that result from the multiple testing problem, and they lack formal statistical guarantees. Random chance alone will yield a large number of apparent correlations with any target variable when thousands of candidate features are examined.