Understanding Refusal in Language Models with Sparse Autoencoders

Open in new window