Detecting Underperformance: Noise Injection Increases the Accuracy of Sandbagging LLMs

Neural Information Processing Systems 

Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or "sandbagging", where models deliberately underperform during evaluation.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found