Why Do Some Language Models Fake Alignment While Others Don't?

Open in new window