Large Language Model
'Architects of AI' named Time Magazine's Person of the Year
'Architects of AI' named Time Magazine's Person of the Year Time Magazine's Person of the Year for 2025 is not a single person. Instead, the magazine has recognised the year's most influential figure as the architects of artificial intelligence (AI). Nvidia boss Jensen Huang, Meta head Mark Zuckerberg, X owner Elon Musk and AI godmother Fei-Fei Li are among those depicted on one of the magazine's two covers. Experts say it highlights how quickly AI, and the firms behind it, are reshaping society. It comes as a boom in the technology, ushered in by OpenAI's launch of ChatGPT in late 2022, continues at pace.
The Download: solar geoengineering's future, and OpenAI is being sued
The Download: solar geoengineering's future, and OpenAI is being sued Solar geoengineering aims to manipulate the climate by bouncing sunlight back into space. In theory, it could ease global warming. But as interest in the idea grows, so do concerns about potential consequences. A startup called Stardust Solutions recently raised a $60 million funding round, the largest known to date for a geoengineering startup. My colleague James Temple has a new story out about the company, and how its emergence is making some researchers nervous. So far, the field has been limited to debates, proposed academic research, and--sure--a few fringe actors to keep an eye on.
Google's Gemini AI comes to Chrome on iPhone and iPad
GPU prices could follow RAM's big rise Google's Gemini AI comes to Chrome on iPhone and iPad It can summarize pages, create a FAQ on a topic and modify recipes for your dietary needs. After rolling it out on desktop and Android earlier in 2025, Google is finally bringing its built-in Gemini AI experience to iPhone and iPad. It offers new features like summarizing pages and helping you test your knowledge about a subject you're learning. As with any AI tool, though, it shouldn't be trusted for anything important given the possibility of hallucinations and other errors. When it arrives on your iOS device, tapping the spark icon at the left of the address bar (in place of the Google Lens camera) brings up a Pages tool that offers Lens and the new feature, Ask Gemini.
The Story Behind TIME's 2025 Person of the Year Covers
Pine is the Creative Director at TIME. To illustrate the choice of the Architects of AI as TIME's 2025 Person of the Year, we asked two separate artists to help us visualize the incredibly complex technological revolution that is currently underway. London-based illustrator and graphics animator Peter Crowther and digital painter Jason Seiler each created an image that speaks to the duality AI has produced - man vs. machine. Inspired by the inner workings of computer chips, Crowther's intricate AI structure looms large over the busy construction site.
A Minimalist Optimizer Design for LLM Pretraining
Glentis, Athanasios, Li, Jiaxiang, Han, Andi, Hong, Mingyi
Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which introduce extra operations and require significant more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), which boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple LLaMA models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B model, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon, in terms of both perplexity and memory consumption.
Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search
Fadeeva, Ekaterina, Goloburda, Maiya, Rubashevskii, Aleksandr, Vashurin, Roman, Shelmanov, Artem, Nakov, Preslav, Sachan, Mrinmaya, Panov, Maxim
Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression
Positional encoding (PE) is a core architectural component of Transformers, yet its impact on the Transformer's generalization and robustness remains unclear. In this work, we provide the first generalization analysis for a single-layer Transformer under in-context regression that explicitly accounts for a completely trainable PE module. Our result shows that PE systematically enlarges the generalization gap. Extending to the adversarial setting, we derive the adversarial Rademacher generalization bound. We find that the gap between models with and without PE is magnified under attack, demonstrating that PE amplifies the vulnerability of models. Our bounds are empirically validated by a simulation study. Together, this work establishes a new framework for understanding the clean and adversarial generalization in ICL with PE.
Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws
Dataset distillation (DD) aims to construct compact synthetic datasets that allow models to achieve comparable performance to full-data training while substantially reducing storage and computation. Despite rapid empirical progress, its theoretical foundations remain limited: existing methods (gradient, distribution, trajectory matching) are built on heterogeneous surrogate objectives and optimization assumptions, which makes it difficult to analyze their common principles or provide general guarantees. Moreover, it is still unclear under what conditions distilled data can retain the effectiveness of full datasets when the training configuration, such as optimizer, architecture, or augmentation, changes. To answer these questions, we propose a unified theoretical framework, termed configuration--dynamics--error analysis, which reformulates major DD approaches under a common generalization-error perspective and provides two main results: (i) a scaling law that provides a single-configuration upper bound, characterizing how the error decreases as the distilled sample size increases and explaining the commonly observed performance saturation effect; and (ii) a coverage law showing that the required distilled sample size scales linearly with configuration diversity, with provably matching upper and lower bounds. In addition, our unified analysis reveals that various matching methods are interchangeable surrogates, reducing the same generalization error, clarifying why they can all achieve dataset distillation and providing guidance on how surrogate choices affect sample efficiency and robustness. Experiments across diverse methods and configurations empirically confirm the derived laws, advancing a theoretical foundation for DD and enabling theory-driven design of compact, configuration-robust dataset distillation.
Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models
Ye, Yifan, Ma, Jiaqi, Cen, Jun, Lu, Zhihe
Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on \href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}