AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Open in new window