Goto

Collaborating Authors

 reasoning model



What's next for Chinese open-source AI

MIT Technology Review

Chinese open models are spreading fast, from Hugging Face to Silicon Valley. In this photo illustration, the DeepSeek apps is seen on a phone in front of a flag of China on January 28, 2025 in Hong Kong, China. The past year has marked a turning point for Chinese AI. Since DeepSeek released its R1 reasoning model in January 2025, Chinese companies have repeatedly delivered AI models that match the performance of leading Western models at a fraction of the cost. Just last week the Chinese firm Moonshot AI released its latest open-weight model, Kimi K2.5, which came close to top proprietary systems such as Anthropic's Claude Opus on some early benchmarks. The difference: K2.5 is roughly one-seventh Opus's price.


What's next for AI in 2026

MIT Technology Review

Our AI writers make their big bets for the coming year--here are five hot trends to watch. In an industry in constant flux, sticking your neck out to predict what's coming next may seem reckless. But for the last few years we've done just that--and we're doing it again. How did we do last time? Here are our big bets for the next 12 months. The last year shaped up as a big one for Chinese open-source models.


AI Wrapped: The 14 AI terms you couldn't avoid in 2025

MIT Technology Review

AI Wrapped: The 14 AI terms you couldn't avoid in 2025 From "superintelligence" to "slop," here are the words and phrases that defined another year of AI craziness. If the past 12 months have taught us anything, it's that the AI hype train is showing no signs of slowing. It's hard to believe that at the beginning of the year, DeepSeek had yet to turn the entire industry on its head, Meta was better known for trying (and failing) to make the metaverse cool than for its relentless quest to dominate superintelligence, and vibe coding wasn't a thing. If that's left you feeling a little confused, fear not. As we near the end of 2025, our writers have taken a look back over the AI terms that dominated the year, for better or worse. Make sure you take the time to brace yourself for what promises to be another bonkers year.


Five AI Developments That Changed Everything This Year

TIME - Tech

President Donald Trump speaks in the Roosevelt Room flanked by Masayoshi Son, Larry Ellison, and Sam Altman at the White House on January 21, 2025. President Donald Trump speaks in the Roosevelt Room flanked by Masayoshi Son, Larry Ellison, and Sam Altman at the White House on January 21, 2025. In case you missed it, 2025 was a big year for AI. It became an economic force, propping up the stock market, and a geopolitical pawn, redrawing the frontlines of Great Power competition. It had both global and deeply personal effects, changing the ways that we think, write, and relate.


OpenAI Rolls Back ChatGPT's Model Router System for Most Users

WIRED

As OpenAI scrambles to improve ChatGPT, it's ditching a feature in its free tier that contributed to last summer's user revolt. OpenAI has quietly reversed a major change to how hundreds of millions of people use ChatGPT . On a low-profile blog that tracks product changes, the company said that it rolled back ChatGPT's model router--an automated system that sends complicated user questions to more advanced "reasoning" models--for users on its Free and $5-a-month Go tiers. Instead, those users will now default to GPT-5.2 Instant, the fastest and cheapest-to-serve version of OpenAI's new model series. Free and Go users will still be able to access reasoning models, but they will have to select them manually.


The great AI hype correction of 2025

MIT Technology Review

Four ways to think about this year's reckoning When OpenAI released a free web app called ChatGPT in late 2022, it changed the course of an entire industry--and several world economies. Millions of people started talking to their computers, and their computers started talking back. We were enchanted, and we expected more. Technology companies scrambled to stay ahead, putting out rival products that outdid one another with each new release: voice, images, video. With nonstop one-upmanship, AI companies have presented each new product drop as a major breakthrough, reinforcing a widespread faith that this technology would just keep getting better. Boosters told us that progress was exponential.


Benchmarking World-Model Learning

Warrier, Archana, Nguyen, Dat, Naim, Michelangelo, Jain, Moksh, Liang, Yichao, Schroeder, Karen, Yang, Cambridge, Tenenbaum, Joshua B., Vollmer, Sebastian, Ellis, Kevin, Tavares, Zenna

arXiv.org Artificial Intelligence

Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended $\unicode{x2014}$ models should support many different tasks unknown ahead of time $\unicode{x2014}$ and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template $\unicode{x2014}$ reward-free exploration, derived tests, and behavior-based scoring $\unicode{x2014}$ to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.


LightSearcher: Efficient DeepSearch via Experiential Memory

Lan, Hengzhi, Yu, Yue, Qian, Li, Peng, Li, Wu, Jie, Liu, Wei, Luan, Jian, Bai, Ting

arXiv.org Artificial Intelligence

DeepSearch paradigms have become a core enabler for deep reasoning models, allowing them to invoke external search tools to access up-to-date, domain-specific knowledge beyond parametric boundaries, thereby enhancing the depth and factual reliability of reasoning. Building upon this foundation, recent advances in reinforcement learning (RL) have further empowered models to autonomously and strategically control search tool usage, optimizing when and how to query external knowledge sources. Yet, these RL-driven DeepSearch systems often reveal a see-saw trade-off between accuracy and efficiency-frequent tool invocations can improve factual correctness but lead to unnecessary computational overhead and diminished efficiency. To address this challenge, we propose LightSearcher, an efficient RL framework that incorporates textual experiential memory by learning contrastive reasoning trajectories to generate interpretable summaries of successful reasoning patterns. In addition, it employs an adaptive reward shaping mechanism that penalizes redundant tool calls only in correct-answer scenarios. This design effectively balances the inherent accuracy-efficiency trade-off in DeepSearch paradigms. Experiments on four multi-hop QA benchmarks show that LightSearcher maintains accuracy comparable to SOTA baseline ReSearch, while reducing search tool invocations by 39.6%, inference time by 48.6%, and token consumption by 21.2%, demonstrating its superior efficiency.


Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Guo, Dadi, Liu, Jiayu, Fan, Zhiyuan, He, Zhitao, Li, Haoran, Li, Yuxin, Wang, Yumeng, Fung, Yi R.

arXiv.org Artificial Intelligence

Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models' self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.