SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

Open in new window