MIT Technology Review
The Download: AI benchmarks, and Spain's grid blackout
SWE-Bench (pronounced "swee bench") launched in November 2024 as a way to evaluate an AI model's coding skill. It has since quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google--and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. Despite all the fervor, this isn't exactly a truthful assessment of which model is "better." Entrants have begun to game the system--which is pushing many others to wonder whether there's a better way to actually measure AI achievement.
How to build a better AI benchmark
Developers of these coding agents aren't necessarily doing anything as straightforward cheating, but they're crafting approaches that are too neatly tailored to the specifics of the benchmark. The initial SWE-Bench test set was limited to programs written in Python, which meant developers could gain an advantage by training their models exclusively on Python code. Soon, Yang noticed that high-scoring models would fail completely when tested on different programming languages--revealing an approach to the test that he describes as "gilded." "It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart," Yang says. You're designing to make a SWE-Bench agent, which is much less interesting." The SWE-Bench issue is a symptom of a more sweeping--and complicated--problem in AI evaluation, and one that's increasingly sparking heated debate: The benchmarks the industry uses to guide development are drifting further and further away from evaluating actual capabilities, calling their basic value into question. Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as "an evaluation crisis": the industry has fewer trusted methods for measuring capabilities and no clear path to better ones. "Historically, benchmarks were the way we evaluated AI systems," says Vanessa Parli, director of research at Stanford University's Institute for Human-Centered AI. "Is that the way we want to evaluate systems going forward?
The business of the future is adaptive
The journey to adaptive production is not just about addressing today's pressures, like rising costs and supply chain disruptions--it's about positioning businesses for long-term success in a world of constant change. "In the coming years," says Jana Kirchheim, director of manufacturing for Microsoft Germany, "I expect that new key technologies like copilots, small language models, high-performance computing, or the adaptive cloud approach will revolutionize the shop floor and accelerate industrial automation by enabling faster adjustments and re-programming for specific tasks." These capabilities make adaptive production a transformative force, enhancing responsiveness and opening doors to systems with increasing autonomy--designed to complement human ingenuity rather than replace it. These advances enable more than technical upgrades--they drive fundamental shifts in how manufacturers operate. John Hart, professor of mechanical engineering and director of MIT's Center for Advanced Production Technologies, explains that automation is "going from a rigid high-volume, low-mix focus"--where factories make large quantities of very few products--"to more flexible high-volume, high-mix, and low-volume, high-mix scenarios"--where many product types can be made in custom quantities.
The Download: Neuralink's AI boost, and Trump's tariffs
Is this the end of animal testing? Animal studies are notoriously bad at identifying human treatments. Around 95% of the drugs developed through animal research fail in people, but until recently there was no other option. Now organs on chips, also known as microphysiological systems, may offer a truly viable alternative. They're triumphs of bioengineering, intricate constructions furrowed with tiny channels that are lined with living human tissues that expand and contract with the flow of fluid and air, mimicking key organ functions like breathing, blood flow, and peristalsis, the muscular contractions of the digestive system.
This patient's Neuralink brain implant gets a boost from generative AI
Smith was about to get brain surgery, but Musk's virtual appearance foretold a greater transformation. Smith's brain was about to be inducted into a much larger technology and media ecosystem--one of whose goals, the billionaire has said, is to achieve a "symbiosis" of humans and AI. Consider what unfolded on April 27, the day Smith announced on X that he'd received the brain implant and wanted to take questions. One of the first came from "Adrian Dittmann," an account often suspected of being Musk's alter ego. Can you describe how it feels to type and interact with technology overall using the Neuralink?" It feels wild, like I'm a cyborg from a sci-fi movie, moving a cursor just by thinking about it. At first, it was a struggle--my cursor acted like a drunk mouse, barely hitting targets, but after weeks of training with imagined hand and jaw movements, it clicked, almost like riding a bike."
The Download: a longevity influencer's new religion, and humanoid robots' shortcomings
Bryan Johnson is on a mission to not die. The 47-year-old multimillionaire has already applied his slogan "Don't Die" to events, merchandise, and a Netflix documentary. Now he's founding a Don't Die religion. Johnson, who famously spends millions of dollars on scans, tests, supplements, and a lifestyle routine designed to slow or reverse the aging process, has enjoyed extensive media coverage, and a huge social media following. For many people, he has become the face of the longevity field.
Why the humanoid workforce is running late
But Rus and many others I spoke with at the expo suggest that this hype just doesn't add up. Humanoids "are mostly not intelligent," she said. Rus showed a video of herself speaking to an advanced humanoid that smoothly followed her instruction to pick up a watering can and water a nearby plant. But when she asked it to "water" her friend, the robot did not consider that humans don't need watering like plants and moved to douse the person. "These robots lack common sense," she said.
Bryan Johnson wants to start a new religion in which "the body is God"
I sat down with Johnson at an event for people interested in longevity in Berkeley, California, in late April. We spoke on the sidelines after lunch (conference plastic-lidded container meal for me; what seemed to be a plastic-free, compostable box of chicken and vegetables for him), and he sat with an impeccable posture, his expression neutral. Earlier that morning, Johnson, in worn trainers and the kind of hoodie that is almost certainly deceptively expensive, had told the audience about what he saw as the end of humanity. Specifically, he was worried about AI--that we face an "event horizon," a point at which superintelligent AI escapes human understanding and control. He had come to Berkeley to persuade people who are interested in longevity to focus their efforts on AI.
The Download: stereotypes in AI models, and the new age of coding
AI models are riddled with culturally specific biases. A new data set, called SHADES, is designed to help developers combat the problem by spotting harmful stereotypes and other kinds of discrimination that emerge in AI chatbot responses across a wide range of languages. Why it matters: Although tools that spot stereotypes in AI models already exist, the vast majority of them work only on models trained in English. They identify stereotypes in models trained in other languages by relying on machine translations from English, which can fail to recognize stereotypes found only within certain non-English languages. To get around these problematic generalizations, SHADES was built using 16 languages from 37 geopolitical regions.
This data set helps researchers spot harmful stereotypes in LLMs
Although tools that spot stereotypes in AI models already exist, the vast majority of them work only on models trained in English. They identify stereotypes in models trained in other languages by relying on machine translations from English, which can fail to recognize stereotypes found only within certain non-English languages, says Zeerak Talat, at the University of Edinburgh, who worked on the project. To get around these problematic generalizations, SHADES was built using 16 languages from 37 geopolitical regions. SHADES works by probing how a model responds when it's exposed to stereotypes in different ways. The researchers exposed the models to each stereotype within the data set, including through automated prompts, which generated a bias score.