Bayesian Inference
Information-Theoretic Perspectives on Optimizers
The interplay of optimizers and architectures in neural networks is complicated and hard to understand why some optimizers work better on some specific architectures. In this paper, we find that the traditionally used sharpness metric does not fully explain the intricate interplay and introduces information-theoretic metrics called entropy gap to better help analyze. It is found that both sharpness and entropy gap affect the performance, including the optimization dynamic and generalization. We further use information-theoretic tools to understand a recently proposed optimizer called Lion and find ways to improve it.
An interpretation of the Brownian bridge as a physics-informed prior for the Poisson equation
Alberts, Alex, Bilionis, Ilias
Physics-informed machine learning is one of the most commonly used methods for fusing physical knowledge in the form of partial differential equations with experimental data. The idea is to construct a loss function where the physical laws take the place of a regularizer and minimize it to reconstruct the underlying physical fields and any missing parameters. However, there is a noticeable lack of a direct connection between physics-informed loss functions and an overarching Bayesian framework. In this work, we demonstrate that Brownian bridge Gaussian processes can be viewed as a softly-enforced physics-constrained prior for the Poisson equation. We first show equivalence between the variational form of the physics-informed loss function for the Poisson equation and a kernel ridge regression objective. Then, through the connection between Gaussian process regression and kernel methods, we identify a Gaussian process for which the posterior mean function and physics-informed loss function minimizer agree. This connection allows us to probe different theoretical questions, such as convergence and behavior of inverse problems. We also connect the method to the important problem of identifying model-form error in applications.
Clustering Context in Off-Policy Evaluation
Guzman-Olivares, Daniel, Schmidt, Philipp, Golebiowski, Jacek, Bekasov, Artur
Off-policy evaluation can leverage logged data to estimate the effectiveness of new policies in e-commerce, search engines, media streaming services, or automatic diagnostic tools in healthcare. However, the performance of baseline off-policy estimators like IPS deteriorates when the logging policy significantly differs from the evaluation policy. Recent work proposes sharing information across similar actions to mitigate this problem. In this work, we propose an alternative estimator that shares information across similar contexts using clustering. We study the theoretical properties of the proposed estimator, characterizing its bias and variance under different conditions. We also compare the performance of the proposed estimator and existing approaches in various synthetic problems, as well as a real-world recommendation dataset. Our experimental results confirm that clustering contexts improves estimation accuracy, especially in deficient information settings.
Post-Hoc Uncertainty Quantification in Pre-Trained Neural Networks via Activation-Level Gaussian Processes
Bergna, Richard, Depeweg, Stefan, Ordonez, Sergio Calvo, Plenk, Jonathan, Cartea, Alvaro, Hernandez-Lobato, Jose Miguel
Uncertainty quantification in neural networks through methods such as Dropout, Bayesian neural networks and Laplace approximations is either prone to underfitting or computationally demanding, rendering these approaches impractical for large-scale datasets. In this work, we address these shortcomings by shifting the focus from uncertainty in the weight space to uncertainty at the activation level, via Gaussian processes. More specifically, we introduce the Gaussian Process Activation function (GAPA) to capture neuron-level uncertainties. Our approach operates in a post-hoc manner, preserving the original mean predictions of the pre-trained neural network and thereby avoiding the underfitting issues commonly encountered in previous methods. We propose two methods. The first, GAPA-Free, employs empirical kernel learning from the training data for the hyperparameters and is highly efficient during training. The second, GAPA-Variational, learns the hyperparameters via gradient descent on the kernels, thus affording greater flexibility. Empirical results demonstrate that GAPA-Variational outperforms the Laplace approximation on most datasets in at least one of the uncertainty quantification metrics.
Forecasting intermittent time series with Gaussian Processes and Tweedie likelihood
Damato, Stefano, Azzimonti, Dario, Corani, Giorgio
We introduce the use of Gaussian Processes (GPs) for the probabilistic forecasting of intermittent time series. The model is trained in a Bayesian framework that accounts for the uncertainty about the latent function and marginalizes it out when making predictions. We couple the latent GP variable with two types of forecast distributions: the negative binomial (NegBinGP) and the Tweedie distribution (TweedieGP). While the negative binomial has already been used in forecasting intermittent time series, this is the first time in which a fully parameterized Tweedie density is used for intermittent time series. We properly evaluate the Tweedie density, which is both zero-inflated and heavy tailed, avoiding simplifying assumptions made in existing models. We test our models on thousands of intermittent count time series. Results show that our models provide consistently better probabilistic forecasts than the competitors. In particular, TweedieGP obtains the best estimates of the highest quantiles, thus showing that it is more flexible than NegBinGP.
Constrained Generative Modeling with Manually Bridged Diffusion Models
Naderiparizi, Saeid, Liang, Xiaoxuan, Zwartsenberg, Berend, Wood, Frank
In this paper we describe a novel framework for diffusion-based generative modeling on constrained spaces. In particular, we introduce manual bridges, a framework that expands the kinds of constraints that can be practically used to form so-called diffusion bridges. We develop a mechanism for combining multiple such constraints so that the resulting multiply-constrained model remains a manual bridge that respects all constraints. We also develop a mechanism for training a diffusion model that respects such multiple constraints while also adapting it to match a data distribution. We develop and extend theory demonstrating the mathematical validity of our mechanisms. Additionally, we demonstrate our mechanism in constrained generative modeling tasks, highlighting a particular high-value application in modeling trajectory initializations for path planning and control in autonomous vehicles.
A Fokker-Planck-Based Loss Function that Bridges Dynamics with Density Estimation
Lu, Zhixin, Kuลmierz, ลukasz, Mihalas, Stefan
We have derived a novel loss function from the Fokker-Planck equation that links dynamical system models with their probability density functions, demonstrating its utility in model identification and density estimation. In the first application, we show that this loss function can enable the extraction of dynamical parameters from non-temporal datasets, including timestamp-free measurements from steady non-equilibrium systems such as noisy Lorenz systems and gene regulatory networks. In the second application, when coupled with a density estimator, this loss facilitates density estimation when the dynamic equations are known. For density estimation, we propose a density estimator that integrates a Gaussian Mixture Model with a normalizing flow model. It simultaneously estimates normalized density, energy, and score functions from both empirical data and dynamics. It is compatible with a variety of data-based training methodologies, including maximum likelihood and score matching. It features a latent space akin to a modern Hopfield network, where the inherent Hopfield energy effectively assigns low densities to sparsely populated data regions, addressing common challenges in neural density estimators. Additionally, this Hopfield-like energy enables direct and rapid data manipulation through the Concave-Convex Procedure (CCCP) rule, facilitating tasks such as denoising and clustering. Our work demonstrates a principled framework for leveraging the complex interdependencies between dynamics and density estimation, as illustrated through synthetic examples that clarify the underlying theoretical intuitions.
Stein's unbiased risk estimate and Hyv\"arinen's score matching
Ghosh, Sulagna, Ignatiadis, Nikolaos, Koehler, Frederic, Lee, Amber
We study two G-modeling strategies for estimating the signal distribution (the empirical Bayesian's prior) from observations corrupted with normal noise. First, we choose the signal distribution by minimizing Stein's unbiased risk estimate (SURE) of the implied Eddington/Tweedie Bayes denoiser, an approach motivated by optimal empirical Bayesian shrinkage estimation of the signals. Second, we select the signal distribution by minimizing Hyv\"arinen's score matching objective for the implied score (derivative of log-marginal density), targeting minimal Fisher divergence between estimated and true marginal densities. While these strategies appear distinct, they are known to be mathematically equivalent. We provide a unified analysis of SURE and score matching under both well-specified signal distribution classes and misspecification. In the classical well-specified setting with homoscedastic noise and compactly supported signal distribution, we establish nearly parametric rates of convergence of the empirical Bayes regret and the Fisher divergence. In a commonly studied misspecified model, we establish fast rates of convergence to the oracle denoiser and corresponding oracle inequalities. Our empirical results demonstrate competitiveness with nonparametric maximum likelihood in well-specified settings, while showing superior performance under misspecification, particularly in settings involving heteroscedasticity and side information.
Bayesian Computation in Deep Learning
Chen, Wenlong, Li, Bolian, Zhang, Ruqi, Li, Yingzhen
Bayesian computation has achieved profound success in many modeling tasks with statistics tools such as generalized linear models (Dobson and Barnett, 2018; Nelder and Wedderburn, 1972). Yet these traditional tools fail to produce satisfactory predictions for high-dimensional and highly complex data such as images, speech and videos. Deep Learning (LeCun et al., 2015a) provides an attractive solution. At the time of late 2023, deep neural networks achieve accurate predictions for image classification (Dehghani et al., 2023), segmentation (Kirillov et al., 2023) and speech recognition tasks (Zhang et al., 2023). Meanwhile they have also demonstrated an astonishing capability for generating photo-realistic and/or artistic images (Rombach et al., 2022), music (Agostinelli et al., 2023) and videos (Liang et al., 2022). Nowadays deep neural networks have become a standard modeling tool for many of the applications in AI and related fields, and the success of deep learning so far are based on training deterministic deep neural networks on big data. So one might ask: is there a place for Bayesian computation in modern deep learning?
Sparkle: A Statistical Learning Toolkit for High-Dimensional Hawkes Processes in Python
This paper introduce the Python package Sparklen (see Lacoste (2025)), which implements a complete set of statistical learning methods for exponential Hawkes processes with an emphasize on high-dimension setting. Hawkes processes, introduced in Hawkes (1971), form a specific but rather versatile class of point processes. Such processes model time series in which the occurrence of one event temporarily increases the probability of other events occurring. This intrinsic ability to take into account self-exciting effects makes them particularly interesting for real data modeling. Historically applied in seismology (see Ogata (1988)), they have since been used in a wide variety of other fields, including neuroscience in Reynaud-Bouret, Rivoirard, and Tuleau-Malot (2013), finance in Bacry, Mastromatteo, and Muzy (2015), ecology in Denis, Dion-Blanc, Lacoste, Sansonnet, and Bas (2024). The multidimensional version, known as the Multivariate Hawkes Processes (MHP), captures additionally interactions among each univariate process within a network. This generalization enables the modeling of more intricate dynamics, significantly expanding the range of potential applications. For example, MHP has been applied to model action potentials within neural networks in Bonnet, Dion-Blanc, Gindraud, and Lemler (2022), or for trend detection in social networks in Pinto, Chahed, and Altman (2015).