Large Language Model
Sampling Data with Chains of Forward-Backward Diffusion Steps
Kang, Hyunmo, Levi, Noam Itzhak, Wegner, Corinna Elena, Korchinski, Daniel J., Wyart, Matthieu
Sampling from learned high-dimensional distributions is a foundational computational problem. We introduce U-turn chains: Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, we show that minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images. In both modalities, minimal U-turns relax slowly, especially for high-level features approximated by deep representations in CNNs or LLMs. The layer-ordering inversion appears only at large noise when mixing is efficient -- signatures consistent with strongly constrained, weakly mixing local dynamics. We discuss the implications of these results for sampling with diffusion models.
Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination
Agnimo, Yedidia, Korba, Anna, Blangero, Annabelle, Chesneau, Nicolas, Alahari, Karteek
Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.
BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning
Gong, Shijin, Xu, Erhan, Ye, Kai, Quinzan, Francesco, Livieri, Giulia, Shi, Chengchun
Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.
From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
Liang, Yuchen, Shroff, Ness, Liang, Yingbin
Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, but, especially for uniform-rate models, they often require many steps to generate a single sample. Existing acceleration methods either rely on training additional quantities or suffer from slow mixing. In this work, we propose a novel Gibbs-based corrector for discrete diffusion models, termed Gibbs-Accelerated Discrete Diffusion (GADD). GADD leverages the structure of the concrete score function to construct Gibbs posterior likelihoods directly, without requiring any additional training beyond standard score estimation. We show that GADD achieves an overall sampling complexity of $\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$, yielding the first such rate for diffusion-based samplers for uniform-rate discrete diffusion models. We also conduct numerical experiments demonstrating the practical advantages of GADD across synthetic data, zero-shot text sampling, and zero-shot conditional music generation. These results corroborate the theory and show that GADD consistently improves sample quality and wall-clock efficiency over standard baselines, including vanilla Euler methods and CTMC correctors. Beyond this, our theoretical analysis introduces a novel framework for analyzing predictor-corrector methods in discrete diffusion models, which may be of independent interest. Unlike existing approaches that rely on the Girsanov change-of-measure technique, our method is based on an induction argument that tracks error propagation across predictor iterations while accounting for inaccuracies in the corrector updates.
Sam Altman Says AI 'Jobs Apocalypse' He Once Predicted Probably Won't Happen. What Changed?
Sam Altman Says AI'Jobs Apocalypse' He Once Predicted Probably Won't Happen. OpenAI CEO Sam Altman speaks during the BlackRock Infrastructure Summit on March 11, 2026 in Washington, DC. OpenAI CEO Sam Altman speaks during the BlackRock Infrastructure Summit on March 11, 2026 in Washington, DC. Throughout his rise to becoming one of the most influential CEOs in artificial intelligence, OpenAI's Sam Altman made repeated bold assertions about the impact that the new technology would have on jobs. He has said that AI will "probably replace most of the jobs people do today," that entire job categories will be "totally, totally gone," and that those impacted by the dramatic shifts will "find all sorts of new things to do. Now, however, Altman appears to have changed his tune, saying he is "delighted to be wrong" about the impact AI would have on employment. I don't think we're going to have the kind of jobs apocalypse that some of the companies in our space advocate or talk about, he said during a virtual interview at a Commonwealth Bank of Australia (CBA) conference in Sydney on Tuesday. "I thought there would have been more impact on entry-level white-collar jobs being eliminated by now than has actually happened, Altman said.
Pope Leo made me rethink how I use AI
PCWorld examines Pope Leo XIV's encyclical on AI, which emphasizes that artificial intelligence reflects creator biases and lacks genuine empathy or real-world experience. The Pope calls AI a "valuable tool that requires vigilance," advising users to adopt a more thoughtful, slower-paced approach when interacting with models like ChatGPT, Claude, and Gemini. This papal guidance encourages users to actively consider when, why, and what they ask AI systems, recognizing their limitations despite sophisticated responses. Delving into Pope Leo XIV's exhaustive treatise about humanity and AI, I was struck by a recurring theme: AI simulates fundamental human traits that it doesn't actually possess. For starters, AI lacks the grounding we humans get from our real-world experiences, Pope Leo noted in his first encyclical, which was released Monday by the Vatican. Yes, AI models like ChatGPT (or more specifically, GPT), Claude, and Gemini are trained on mountains of data that seemingly represent the entirety of human knowledge. But all that data is just that: data.
AI Agents Plunged the Tech World Into Chaos. Here's Exactly How That Happened
Here's Exactly How That Happened The definitive story of how Claude Code and OpenClaw kicked off computing's biggest transformation possibly ever. "Hi, my name is Peter, and I'm a Claudeholic." It was August 2025 and Peter Steinberger was addressing a meetup in London called Claude Code Anonymous. Steinberger and some fellow addicts had arranged the event to network with people like themselves--techies swept up by coding tools such as Anthropic's paradigm-busting Claude Code. "I dedicate pretty much all my waking time to this, yet it doesn't feel enough," he told the gathering in a cozy, brick-walled room. A few months later, Anthropic released a new version of Claude Code, and the ranks of Claudeholics exploded . Called Opus 4.5, it could handle more complicated programming tasks, retain much more in its memory, run for many hours on end, and manage a team of AI subagents. Anthropic has what it describes as a "notoriously difficult" take-home exam for prospective engineering hires; in a head-to-head comparison of those people and its models, Anthropic claimed that Opus 4.5 "scored higher than any human candidate ever," which "raises questions on how AI will change engineering as a profession."
Causality as the Statistical Conscience of Artificial Intelligence: From Pearl's Ladder to Trustworthy Machines
Modern Artificial Intelligence achieves remarkable predictive power by optimizing statistical risk functionals over vast corpora. Yet a gap separates this from genuine intelligence: the inability to distinguish correlation from causation. This paper argues that causal inference (identifying mechanisms invariant under intervention) is AI's indispensable statistical conscience. Without causal grounding, AI systems are correlation machines: powerful in familiar domains, brittle under distribution shift, and biased in high-stakes settings. Three contributions develop this argument. First, a Statistical Necessity Theorem for Causal Generalization: any algorithm achieving out-of-distribution generalization must encode causal structure, formalizing the distinction between prediction P(Y|X) and intelligence P(Y|do(X)). Second, a unified framework connects Pearl's do-calculus, the Potential Outcomes framework, Double Machine Learning, and Invariant Risk Minimization as a family of Causal Statistical Estimators, each identifying interventional distributions under different assumptions. Third, three AI failure modes (hallucination in large language models, reward hacking in reinforcement learning from human feedback, and degradation under distribution shift) are manifestations of causal blindness, each admitting a principled statistical remedy. Trustworthy AI is, at its core, a problem of causal statistics. The statistical community is not merely equipped to solve it -- it is the only community with the foundational tools to do so rigorously.
Assessing the Operational Viability of Foundation Models for Time Series Forecasting
Soni, Kavin, Das, Debanshu, Guduguntla, Vamshi
Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approaches achieve strong performance, they require domain-specific training, feature engineering, and ongoing maintenance. Large-scale foundation models have recently emerged as a zero-shot alternative, avoiding task-specific training much like LLMs. In this work, we evaluate foundation models against standard supervised approaches. Rather than focusing solely on aggregate accuracy, we analyze performance across four operational regimes: periodic human-centric systems, physically constrained processes, stochastic financial markets, and heterogeneous demand forecasting. Our results characterize optimal deployment areas. Foundation models perform well in domains with transferable periodic structures and are efficient for cold-start or long-tail scenarios. Conversely, supervised specialists maintain higher precision in systems governed by strict physical constraints. In financial domains, newer foundation models are rapidly closing the performance gap with supervised specialists. We further quantify trade-offs in inference latency, data drift adaptability, and deployment constraints. Finally, we propose a Complexity Router that assigns each series to the optimal model class using empirical features. We demonstrate that this selective routing achieves higher accuracy and significantly lower inference costs compared to deploying a universal foundation model, providing a practical framework for balancing generalization and efficiency.
An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits
We audit alignment-induced shifts in residual-stream activations of three open-weight instruction-tuned LLMs (Llama-3.1-8B-Instruct, Gemma-2-9B-it, Qwen-2.5-7B-Instruct) using the effective rank of the alignment modification matrix on safety-relevant inputs, rho_eps := rank_eps(M_Ds)/d, which formalizes the single-refusal-direction observation of Arditi et al. (2024) as a continuous quantity. The paper has three contributions. (1) Confound-controlled measurement: a four-variant decomposition (M_naive, M_template, M_aligned, M_DiD) separates chat-template formatting, alignment-stage shift, and the refusal-mediating direction, and recovers the Arditi refusal direction on M_DiD at |cos| in {0.77, 0.86, 0.50} (Llama/Gemma/Qwen); chat-template-controlled rho_eps is {0.0029, 0.0048, 0.0044}, and the centered SVD residual is 4-7x larger. (2) Constructive calibration on a 3-layer MLP across rho_eps in {0.008, 0.17, 0.33, 0.40} exhibits a sweet-spot vs. brittle distinction: mild rank-maximization (lambda=5) buys ablation robustness, while strong regularization at the same nominal rho_eps (lambda=50) does not. rho_eps is a diagnostic for fragility, not a target whose mechanical inflation buys robustness. (3) Limits of rank-based diagnostics: (a) not safety-specific (LRH baseline is 2-3x the safety value); (b) SVD principal ordering does not match causal ordering (Llama u_2 inert despite ranking second; cumulative ablation non-monotone at k=5); (c) the spectral-gap hypothesis required to upgrade the O(rho_eps * d) achievability bound to a matching Mirsky-route lower bound fails empirically (1/90 Llama layer-reference pairs, 0/36 MLP combinations) and structurally (kappa_lb <= 2/(eps * r)). The matching lower bound remains an open problem.