Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Open in new window