Applying sparse autoencoders to unlearn knowledge in language models

Open in new window