Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Open in new window