Goto

Collaborating Authors

 Large Language Model


Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

arXiv.org Machine Learning

Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.


Trust Region Masking for Long-Horizon LLM Reinforcement Learning

arXiv.org Machine Learning

Policy gradient methods for large language models optimize a surrogate objective computed from samples of a rollout policy $ฯ€_{\text{roll}}$. When $ฯ€_{\text{roll}} \ne ฯ€_ฮธ$, there is approximation error between the surrogate and the true objective. Prior work has shown that this off-policy mismatch is unavoidable in modern LLM-RL due to implementation divergence, mixture-of-experts routing discontinuities, and distributed training staleness. Classical trust region bounds on the resulting error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. We derive two tighter bounds: a Pinsker-Marginal bound scaling as $O(T^{3/2})$ and a Mixed bound scaling as $O(T)$. Crucially, both bounds depend on $D_{kl}^{tok,max}$ -- the maximum token-level KL divergence across all positions in a sequence. This is inherently a sequence-level quantity: it requires examining the entire trajectory to compute, and therefore cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region, providing the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.


Theory and Algorithms for Learning with Multi-Class Abstention and Multi-Expert Deferral

arXiv.org Machine Learning

Large language models (LLMs) have achieved remarkable performance but face critical challenges: hallucinations and high inference costs. Leveraging multiple experts offers a solution: deferring uncertain inputs to more capable experts improves reliability, while routing simpler queries to smaller, distilled models enhances efficiency. This motivates the problem of learning with multiple-expert deferral. This thesis presents a comprehensive study of this problem and the related problem of learning with abstention, supported by strong consistency guarantees. First, for learning with abstention (a special case of deferral), we analyze score-based and predictor-rejector formulations in multi-class classification. We introduce new families of surrogate losses and prove strong non-asymptotic, hypothesis set-specific consistency guarantees, resolving two existing open questions. We analyze both single-stage and practical two-stage settings, with experiments on CIFAR-10, CIFAR-100, and SVHN demonstrating the superior performance of our algorithms. Second, we address general multi-expert deferral in classification. We design new surrogate losses for both single-stage and two-stage scenarios and prove they benefit from strong $H$-consistency bounds. For the two-stage scenario, we show that our surrogate losses are realizable $H$-consistent for constant cost functions, leading to effective new algorithms. Finally, we introduce a novel framework for regression with deferral to address continuous label spaces. Our versatile framework accommodates multiple experts and various cost structures, supporting both single-stage and two-stage methods. It subsumes recent work on regression with abstention. We propose new surrogate losses with proven $H$-consistency and demonstrate the empirical effectiveness of the resulting algorithms.


Understanding the Mechanisms of Fast Hyperparameter Transfer

arXiv.org Machine Learning

The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from small-scale grid searches to large models with minimal performance loss. To understand the principles governing such transfer strategy, we develop a general conceptual framework for reasoning about HP transfer across scale, characterizing transfer as fast when the suboptimality it induces vanishes asymptotically faster than the finite-scale performance gap. We show formally that fast transfer is equivalent to useful transfer for compute-optimal grid search, meaning that transfer is asymptotically more compute-efficient than direct tuning. While empirical work has found that the Maximal Update Parameterization ($ฮผ$P) exhibits fast transfer when scaling model width, the mechanisms remain poorly understood. We show that this property depends critically on problem structure by presenting synthetic settings where transfer either offers provable computational advantage or fails to outperform direct tuning even under $ฮผ$P. To explain the fast transfer observed in practice, we conjecture that decomposing the optimization trajectory reveals two contributions to loss reduction: (1) a width-stable component that determines the optimal HPs, and (2) a width-sensitive component that improves with width but weakly perturbs the HP optimum. We present empirical evidence for this hypothesis across various settings, including large language model pretraining.


Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

arXiv.org Machine Learning

Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $ฮผ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.


3 Common Misunderstandings About AI in 2025

TIME - Tech

Children and parked cars are color-coded on a monitor inside a Mercedes-Benz S-Class during an autonomous driving and AI demonstration in Immendingen, Germany on July 17, 2018. Children and parked cars are color-coded on a monitor inside a Mercedes-Benz S-Class during an autonomous driving and AI demonstration in Immendingen, Germany on July 17, 2018. In 2025, misconceptions about AI flourished as people struggled to make sense of the rapid development and adoption of the technology. Here are three popular ones to leave behind in the New Year. When GPT-5 was released in May, people wondered (not for the first time) if AI was hitting a wall.


'This will be a stressful job': Sam Altman offers 555k salary to fill most daunting role in AI

The Guardian

'You'll jump into the deep end pretty much immediately,' Altman said while announcing the vacancy. 'You'll jump into the deep end pretty much immediately,' Altman said while announcing the vacancy. 'This will be a stressful job': Sam Altman offers $555k salary to fill most daunting role in AI New head of preparedness at OpenAI will face unnerving in-tray amid fears from some experts that AI could'turn on us' Mon 29 Dec 2025 09.44 ESTLast modified on Mon 29 Dec 2025 10.10 EST The maker of ChatGPT has advertised a $555,000-a-year vacancy with a daunting job description that would cause Superman to take a sharp intake of breath. In what may be close to the impossible job, the "head of preparedness" at OpenAI will be directly responsible for defending against risks from ever more powerful AIs to human mental health, cybersecurity and biological weapons. That is before the successful candidate has to start worrying about the possibility that AIs may soon begin training themselves amid fears from some experts they could "turn against us". "This will be a stressful job, and you'll jump into the deep end pretty much immediately," said Sam Altman, the chief executive of the San Francisco-based organisation, as he launched the hunt to fill "a critical role" to "help the world".


3 New Tricks to Try With Google Gemini Live After Its Latest Major Upgrade

WIRED

Google's AI is now even smarter, and more versatile. Gemini Live is the more conversational, natural language way of interacting with the Google Gemini AI bot using your voice. The idea is you chat with it like you would chat with a friend, interruptions and all, even if the actual answers are the same as you'd get from typing your queries into Gemini as normal. Now, about a year and a half after its debut, Gemini Live has been given what Google is describing as its "biggest update ever." The update makes the Gemini Live mode even more natural and even more conversational than before, with a better understanding of tone, nuance, pronunciation, and rhythm.


A Unified Definition of Hallucination, Or: It's the World Model, Stupid

arXiv.org Machine Learning

Despite numerous attempts to solve the issue of hallucination since the inception of neural language models, it remains a problem in even frontier large language models today. Why is this the case? We walk through definitions of hallucination used in the literature from a historical perspective up to the current day, and fold them into a single definition of hallucination, wherein different prior definitions focus on different aspects of our definition. At its core, we argue that hallucination is simply inaccurate (internal) world modeling, in a form where it is observable to the user (e.g., stating a fact which contradicts a knowledge base, or producing a summary which contradicts a known source). By varying the reference world model as well as the knowledge conflict policy (e.g., knowledge base vs. in-context), we arrive at the different existing definitions of hallucination present in the literature. We argue that this unified view is useful because it forces evaluations to make clear their assumed "world" or source of truth, clarifies what should and should not be called hallucination (as opposed to planning or reward/incentive-related errors), and provides a common language to compare benchmarks and mitigation techniques. Building on this definition, we outline plans for a family of benchmarks in which hallucinations are defined as mismatches with synthetic but fully specified world models in different environments, and sketch out how these benchmarks can use such settings to stress-test and improve the world modeling components of language models.


5 Best apps to use on ChatGPT right now

FOX News

This material may not be published, broadcast, rewritten, or redistributed. Quotes displayed in real-time or delayed by at least 15 minutes. Market data provided by Factset . Powered and implemented by FactSet Digital Solutions . Mutual Fund and ETF data provided by Refinitiv Lipper .