Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models

Open in new window