Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Neural Information Processing Systems 

Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging'', where models deliberately underperform during evaluation.