Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

Open in new window