Mechanistic Anomaly Detection for "Quirky" Language Models

Open in new window