SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Open in new window