Steering Language Model Refusal with Sparse Autoencoders

Open in new window