Uncertainty
The General Theory of Localization Methods
This paper proposes a general machine learning framework called the localization method, which is fundamentally built on two core concepts: localization kernels and local means -- key components that underpin the self-attention mechanism. To establish a rigorous theoretical foundation, the framework is formally defined through two essential pillars: the formulation of the local(-ized) model and the localization trick. We systematically investigate the connections between the localization method and a wide range of existing machine learning models/methods, including (but not limited to) kernel methods, lazy learning, the MeanShift algorithm, relaxation labeling, Hopfield networks, local linear embedding (LLE), fuzzy inference, and denoising autoencoders (DAEs). By dissecting these relationships, we clarify the broader theoretical significance of the localization method and demonstrate its practical applicability across diverse machine learning tasks. Furthermore, we explore advanced extensions of the framework, such as adaptive kernels, hierarchical local models, and non-local models. Notably, we show that the Transformer -- a cornerstone of modern sequence modeling -- can be constructed using hierarchical local models, revealing the ability of the localization method to unify and generalize state-of-the-art architectures. This work not only provides a unified theoretical lens to reinterpret existing models but also offers new methodological tools for designing flexible, data-adaptive learning systems.
A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning
Nguyen, Thien V., Habrard, Amaury, Guedj, Benjamin
Physics-informed machine learning (PIML) integrates mechanistic knowledge, typically in the form of partial differential equations (PDE), into data-driven models. Despite strong empirical performance, its statistical generalisation properties remain poorly understood, particularly in the regression setting with unbounded losses. Existing analyses rely on approximation or stability arguments and do not fully capture how physical structure influences generalisation from finite data. In this work, we develop a PAC-Bayesian framework for PIML that provides high-probability generalisation guarantees in the presence of unbounded losses. We adopt a multi-task perspective that jointly treats data fidelity, PDE residuals, initial and boundary conditions, avoiding the looseness induced by standard union-bound approaches. Our analysis leverages the structure of physics-informed objectives to derive novel bounds where the complexity scales with input-gradient norms of the losses, revealing a direct link between physical regularity and generalisation. We instantiate this framework under Sobolev and Poincaré-type assumptions, yielding two classes of bounds that trade off statistical complexity and smoothness in different regimes. Building on these results, we propose a self-bounding-aware learning algorithm that directly optimises tractable surrogates of the derived bounds, along with a practical procedure to estimate the associated constants in realistic settings. Empirical evaluations on standard PDE benchmarks demonstrate that our bounds are non-vacuous, significantly tighter than union-bound baselines, and can be effectively minimised during training. Overall, our results provide a principled statistical foundation for the generalisation of physics-informed models.
Constrained Bayesian Experimental Design via Online Planning
Guo, Yujia, Huang, Daolang, Zhang, Xinyu, Katt, Sammie, Kaski, Samuel, Bharti, Ayush
Bayesian experimental design (BED) is a principled framework for data-efficient design of sequential experiments. However, existing BED methods are unable to adapt to dynamic constraints inherent in real-world tasks due to budget limitations, varying costs, or physical constraints that restrict how designs evolve over time. In this paper, we introduce a novel approach to BED that enables constrained optimization of experimental designs by combining offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. We empirically demonstrate that our method yields substantially more informative design sequences than existing methods across a range of constrained BED tasks, while incurring only a modest additional computational overhead.
Sampling Data with Chains of Forward-Backward Diffusion Steps
Kang, Hyunmo, Levi, Noam Itzhak, Wegner, Corinna Elena, Korchinski, Daniel J., Wyart, Matthieu
Sampling from learned high-dimensional distributions is a foundational computational problem. We introduce U-turn chains: Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, we show that minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images. In both modalities, minimal U-turns relax slowly, especially for high-level features approximated by deep representations in CNNs or LLMs. The layer-ordering inversion appears only at large noise when mixing is efficient -- signatures consistent with strongly constrained, weakly mixing local dynamics. We discuss the implications of these results for sampling with diffusion models.
Gaussian Process-based learning with new MCMC-based implementation of Wishart prior on correlation matrix
Warrior, Kane, Chakrabarty, Dalia
Gaussian Process (GP) models are widely used as probabilistic models for nonlinear functions because they combine flexible function modelling with uncertainty quantification (Rasmussen and Williams, 2006; Williams, 1998; MacKay, 1992; Neal, 1995). Their predictive performance depends heavily on how kernel hyperparameters are learnt (Sundararajan and Keerthi, 2001). This becomes especially important in higher-dimensional multivariate settings, where many input-specific hyperparameters may be present and where only some inputs may contribute meaningful predictive structure (MacKay, 1992; Neal, 1995; Rasmussen and Williams, 2006; Linkletter et al., 2006; Paananen et al., 2019). In standard Bayesian formulations of GP learning, prior specification is usually imposed directly on kernel hyperparameters such as lengthscales, amplitude parameters, and noise terms (Rasmussen and Williams, 2006; Williams, 1998). This is natural from a modelling point of view, but it does not always give useful control over the covariance structure that those hyperparameters induce over the observed design points (Barnard et al., 2000; Gelman, 2006; Daniels and Kass, 1999; Huang and Wand, 2013). However, it is this induced covariance matrix that directly governs likelihood evaluation, numerical stability, and predictive behaviour (Rasmussen and Williams, 2006; Stein, 1999). 1
From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
Liang, Yuchen, Shroff, Ness, Liang, Yingbin
Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, but, especially for uniform-rate models, they often require many steps to generate a single sample. Existing acceleration methods either rely on training additional quantities or suffer from slow mixing. In this work, we propose a novel Gibbs-based corrector for discrete diffusion models, termed Gibbs-Accelerated Discrete Diffusion (GADD). GADD leverages the structure of the concrete score function to construct Gibbs posterior likelihoods directly, without requiring any additional training beyond standard score estimation. We show that GADD achieves an overall sampling complexity of $\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$, yielding the first such rate for diffusion-based samplers for uniform-rate discrete diffusion models. We also conduct numerical experiments demonstrating the practical advantages of GADD across synthetic data, zero-shot text sampling, and zero-shot conditional music generation. These results corroborate the theory and show that GADD consistently improves sample quality and wall-clock efficiency over standard baselines, including vanilla Euler methods and CTMC correctors. Beyond this, our theoretical analysis introduces a novel framework for analyzing predictor-corrector methods in discrete diffusion models, which may be of independent interest. Unlike existing approaches that rely on the Girsanov change-of-measure technique, our method is based on an induction argument that tracks error propagation across predictor iterations while accounting for inaccuracies in the corrector updates.
Sub-Gaussian Concentration and Entropic Normality of the Maximum Likelihood Estimator
Barnes, Leighton P., Dytso, Alex
It is well known that, under standard regularity conditions, the maximum likelihood estimator (MLE) satisfies a central limit theorem and converges in distribution to a Gaussian random variable as the sample size grows. This paper strengthens this classical result by developing several stronger forms of asymptotic normality for the normalized MLE. With additional assumptions on the score, we first establish sub-Gaussian tail bounds and convergence of all moments for the normalized estimation error. We then prove an entropic central limit theorem for a smoothed version of the estimator, showing convergence in relative entropy to the limiting Gaussian law. When the Fisher information of the normalized estimate is bounded, or its density has bounded first derivative, we further show that the smoothing can be removed, yielding entropic normality of the MLE itself. The proofs develop auxiliary tools that may be of independent interest, including exponential consistency bounds, high-moment estimates, and entropy-control arguments for the estimator.
Variance-Reduced Manifold Sampling via Polynomial-Maximization Density Estimation
Uniform sampling on implicitly defined manifolds is a core primitive in motion planning, constrained simulation, and probabilistic machine learning. MASEM addresses this problem by entropy-maximizing resampling, but its resampling weights depend on a local k-nearest-neighbour density estimate whose errors can be amplified by aggressive resampling temperatures. We ask whether a polynomial-maximization moment estimator can replace the plug-in density rule without changing the surrounding MASEM architecture. The proposed PMM-MASEM module computes shell spacings from nested k-nearest-neighbour radii, estimates their standardized cumulants, and uses a gated PMM2/PMM3 estimator only when the spacing distribution departs from the flat Exp(1) regime; otherwise it falls back to the plug-in/MLE rule. This fallback is essential: on a flat homogeneous manifold the plug-in estimator is already the MLE, so PMM should not outperform it. A local Known-DGP Monte Carlo experiment confirms this gate: the selector returns MLE on flat Exp(1) spacings and reduces density MSE by 22--36% on asymmetric gamma and boundary-spacing regimes. The evidence is not uniformly positive: PMM3 worsens a platykurtic uniform spacing law, and a lightweight resampling-proxy experiment improves seven-lobes coverage but degrades the sine and swiss-roll proxies. The current evidence therefore supports an applicability-boundary result rather than a general MASEM improvement claim.
Optimal Non-Asymptotic Edgeworth Expansions for Multivariate Neural Network Outputs
Finite-width fully connected neural networks with Gaussian-initialized weights deviate from their infinite-width Gaussian limit, exhibiting non-vanishing higher-order cumulants. We approximate these deviations, for a neural network evaluated in a finite number of inputs, using multidimensional Edgeworth expansions of arbitrary order $4m-1$, with $m\in\mathbb{N}$. Assuming that the corresponding Gaussian limit has an invertible covariance matrix and that the activation function is polynomially bounded, we establish a bound of order $n^{-m}$ on the total variation distance between the law of the true network output and its Edgeworth approximation, with matching lower bounds. As an application, we quantify the error in Bayesian posterior distributions when the prior is replaced by its Edgeworth expansion. Our results are more general and also apply to sequences of conditionally Gaussian vectors converging to a Gaussian vector with invertible covariance.
Debiasing Random Oblique Projections for Subsampled OLS and Fast CUR in High Dimensions
Niu, Chengmei, Garg, Sachin, Dereziński, Michał, Liao, Zhenyu
Random sampling is a fundamental tool in modern machine learning and numerical linear algebra for reducing the computational cost of large-scale matrix problems. Existing analyses, however, rely primarily on subspace embedding guarantees, which do not precisely characterize the statistical bias of nonlinear random oblique projections induced by sampling, which arises ubiquitously in subsampled least squares and fast low-rank approximation methods. Because (pseudo)inversion is nonlinear, these random oblique projections can be systematically biased even when the underlying sketch is unbiased, thereby introducing hidden bias into downstream least squares and low-rank approximation solutions. In this work, we develop a unified non-asymptotic theory for random oblique projections in high dimensions. We show that standard random sampling schemes generally induce a systematic statistical bias overlooked by classical subspace embedding-style analyses, and we propose a principled debiasing framework to correct it. We illustrate the power of the theory through two canonical applications. For subsampled least squares, we obtain sharp bias--variance characterizations, reveal previously unrecognized statistical suboptimality in widely used sampling schemes, and identify when debiasing yields provable improvements. For fast CUR decomposition, we develop a debiased approach with improved approximation accuracy. Numerical experiments further validate our theoretical findings.