Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
–Neural Information Processing Systems
Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging'', where models deliberately underperform during evaluation.
Neural Information Processing Systems
Jun-13-2026, 18:57:13 GMT
- Technology: