The Download: AI benchmarks, and Spain's grid blackout

MIT Technology Review 

SWE-Bench (pronounced "swee bench") launched in November 2024 as a way to evaluate an AI model's coding skill. It has since quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google--and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. Despite all the fervor, this isn't exactly a truthful assessment of which model is "better." Entrants have begun to game the system--which is pushing many others to wonder whether there's a better way to actually measure AI achievement.