Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Open in new window