Goto

Collaborating Authors

 Bayesian Learning


A Bayesian Updating Framework for Long-term Multi-Environment Trial Data in Plant Breeding

arXiv.org Machine Learning

In variety testing, multi-environment trials (MET) are essential for evaluating the genotypic performance of crop plants. A persistent challenge in the statistical analysis of MET data is the estimation of variance components, which are often still inaccurately estimated or shrunk to exactly zero when using residual (restricted) maximum likelihood (REML) approaches. At the same time, institutions conducting MET typically possess extensive historical data that can, in principle, be leveraged to improve variance component estimation. However, these data are rarely incorporated sufficiently. The purpose of this paper is to address this gap by proposing a Bayesian framework that systematically integrates historical information to stabilize variance component estimation and better quantify uncertainty. Our Bayesian linear mixed model (BLMM) reformulation uses priors and Markov chain Monte Carlo (MCMC) methods to maintain the variance components as positive, yielding more realistic distributional estimates. Furthermore, our model incorporates historical prior information by managing MET data in successive historical data windows. Variance component prior and posterior distributions are shown to be conjugate and belong to the inverse gamma and inverse Wishart families. While Bayesian methodology is increasingly being used for analyzing MET data, to the best of our knowledge, this study comprises one of the first serious attempts to objectively inform priors in the context of MET data. This refers to the proposed Bayesian updating approach. To demonstrate the framework, we consider an application where posterior variance component samples are plugged into an A-optimality experimental design criterion to determine the average optimal allocations of trials to agro-ecological zones in a sub-divided target population of environments (TPE).


Early-stopped aggregation: Adaptive inference with computational efficiency

arXiv.org Machine Learning

When considering a model selection or, more generally, an aggregation approach for adaptive statistical inference, it is often necessary to compute estimators over a wide range of model complexities including unnecessarily large models even when the true data-generating process is relatively simple, due to the lack of prior knowledge. This requirement can lead to substantial computational inefficiency. In this work, we propose a novel framework for efficient model aggregation called the early-stopped aggregation (ESA): instead of computing and aggregating estimators for all candidate models, we compute only a small number of simpler ones using an early-stopping criterion and aggregate only these for final inference. Our framework is versatile and applies to both Bayesian model selection, in particular, within the variational Bayes framework, and frequentist estimation, including a general penalized estimation setting. We investigate adaptive optimal property of the ESA approach across three learning paradigms. We first show that ESA achieves optimal adaptive contraction rates in the variational Bayes setting under mild conditions. We extend this result to variational empirical Bayes, where prior hyperparameters are chosen in a data-dependent manner. In addition, we apply the ESA approach to frequentist aggregation including both penalization-based and sample-splitting implementations, and establish corresponding theory. As we demonstrate, there is a clear unification between early-stopped Bayes and frequentist penalized aggregation, with a common "energy" functional comprising a data-fitting term and a complexity-control term that drives both procedures. We further present several applications and numerical studies that highlight the efficiency and strong performance of the proposed approach.


Doubly Outlier-Robust Online Infinite Hidden Markov Model

arXiv.org Machine Learning

We derive a robust update rule for the online infinite hidden Markov model (iHMM) for when the streaming data contains outliers and the model is misspecified. Leveraging recent advances in generalised Bayesian inference, we define robustness via the posterior influence function (PIF), and provide conditions under which the online iHMM has bounded PIF. Imposing robustness inevitably induces an adaptation lag for regime switching. Our method, which is called Batched Robust iHMM (BR-iHMM), balances adaptivity and robustness with two additional tunable parameters. Across limit order book data, hourly electricity demand, and a synthetic high-dimensional linear system, BR-iHMM reduces one-step-ahead forecasting error by up to 67% relative to competing online Bayesian methods. Together with theoretical guarantees of bounded PIF, our results highlight the practicality of our approach for both forecasting and interpretable online learning.


Performance of weakly-supervised electronic health record-based phenotyping methods in rare-outcome settings

arXiv.org Machine Learning

Accurately identifying patients with specific medical conditions is a key challenge when using clinical data from electronic health records. Our objective was to comprehensively assess when weakly-supervised prediction methods, which use silver-standard labels (proxy measures of the true outcome) rather than gold-standard true labels, perform well in rare-outcome settings like vaccine safety studies. We compared three methods (PheNorm, MAP, and sureLDA) that combine structured features and features derived from clinical text using natural language processing, through an extensive simulation study with data-generating mechanisms ranging from simple to complex, varying outcome rates, and varying degrees of informative silver labels. We also considered using predicted probabilities to design a chart review validation study. No single method dominated the other across all prediction performance metrics. Probability-guided sampling selected a cohort enriched for patients with more mentions of important concepts in chart notes. SureLDA, the most complex of the three algorithms we considered, often performed well in simulations. Performance depended greatly on selected tuning parameters. Care should be taken when using weakly-supervised prediction methods in rare-outcome settings, particularly if the probabilities will be used in downstream analysis, but these methods can work well when silver labels are strong predictors of true outcomes.


Inferring Change Points in Regression via Sample Weighting

arXiv.org Machine Learning

We study the problem of identifying change points in high-dimensional generalized linear models, and propose an approach based on sample-weighted empirical risk minimization. Our method, Weighted ERM, encodes priors on the change points via weights assigned to each sample, to obtain weighted versions of standard estimators such as M-estimators and maximum-likelihood estimators. Under mild assumptions on the data, we obtain a precise asymptotic characterization of the performance of our method for general Gaussian designs, in the high-dimensional limit where the number of samples and covariate dimension grow proportionally. We show how this characterization can be used to efficiently construct a posterior distribution over change points. Numerical experiments on both simulated and real data illustrate the efficacy of Weighted ERM compared to existing approaches, demonstrating that sample weights constructed with weakly informative priors can yield accurate change point estimators. Our method is implemented as an open-source package, weightederm, available in Python and R.


A Deep Generative Approach to Stratified Learning

arXiv.org Machine Learning

While the manifold hypothesis is widely adopted in modern machine learning, complex data is often better modeled as stratified spaces -- unions of manifolds (strata) of varying dimensions. Stratified learning is challenging due to varying dimensionality, intersection singularities, and lack of efficient models in learning the underlying distributions. We provide a deep generative approach to stratified learning by developing two generative frameworks for learning distributions on stratified spaces. The first is a sieve maximum likelihood approach realized via a dimension-aware mixture of variational autoencoders. The second is a diffusion-based framework that explores the score field structure of a mixture. We establish the convergence rates for learning both the ambient and intrinsic distributions, which are shown to be dependent on the intrinsic dimensions and smoothness of the underlying strata. Utilizing the geometry of the score field, we also establish consistency for estimating the intrinsic dimension of each stratum and propose an algorithm that consistently estimates both the number of strata and their dimensions. Theoretical results for both frameworks provide fundamental insights into the interplay of the underlying geometry, the ambient noise level, and deep generative models. Extensive simulations and real dataset applications, such as molecular dynamics, demonstrate the effectiveness of our methods.


Uncertainty-Aware Sparse Identification of Dynamical Systems via Bayesian Model Averaging

arXiv.org Machine Learning

In many problems of data-driven modeling for dynamical systems, the governing equations are not known a priori and must be selected phenomenologically from a large set of candidate interactions and basis functions. In such situations, point estimates alone can be misleading, because multiple model components may explain the observed data comparably well, especially when the data are limited or the dynamics exhibit poor identifiability. Quantifying the uncertainty associated with model selection is therefore essential for constructing reliable dynamical models from data. In this work, we develop a Bayesian sparse identification framework for dynamical systems with coupled components, aimed at inferring both interaction structure and functional form together with principled uncertainty quantification. The proposed method combines sparse modeling with Bayesian model averaging, yielding posterior inclusion probabilities that quantify the credibility of each candidate interaction and basis component. Through numerical experiments on oscillator networks, we show that the framework accurately recovers sparse interaction structures with quantified uncertainty, including higher-order harmonic components, phase-lag effects, and multi-body interactions. We also demonstrate that, even in a phenomenological setting where the true governing equations are not contained in the assumed model class, the method can identify effective functional components with quantified uncertainty. These results highlight the importance of Bayesian uncertainty quantification in data-driven discovery of dynamical models.


A Predictive View on Streaming Hidden Markov Models

arXiv.org Machine Learning

We develop a predictive-first optimisation framework for streaming hidden Markov models. Unlike classical approaches that prioritise full posterior recovery under a fully specified generative model, we assume access to regime-specific predictive models whose parameters are learned online while maintaining a fixed transition prior over regimes. Our objective is to sequentially identify latent regimes while maintaining accurate step-ahead predictive distributions. Because the number of possible regime paths grows exponentially, exact filtering is infeasible. We therefore formulate streaming inference as a constrained projection problem in predictive-distribution space: under a fixed hypothesis budget, we approximate the full posterior predictive by the forward-KL optimal mixture supported on $S$ paths. The solution is the renormalised top-$S$ posterior-weighted mixture, providing a principled derivation of beam search for HMMs. The resulting algorithm is fully recursive and deterministic, performing beam-style truncation with closed-form predictive updates and requiring neither EM nor sampling. Empirical comparisons against Online EM and Sequential Monte Carlo under matched computational budgets demonstrate competitive prequential performance.


Data-Efficient Non-Gaussian Semi-Nonparametric Density Estimation for Nonlinear Dynamical Systems

arXiv.org Machine Learning

Accurate representation of non-Gaussian distributions of quantities of interest in nonlinear dynamical systems is critical for estimation, control, and decision-making, but can be challenging when forward propagations are expensive to carry out. This paper presents an approach for estimating probability density functions of states evolving under nonlinear dynamics using Seminonparametric (SNP), or Gallant-Nychka, densities. SNP densities employ a probabilists' Hermite polynomial basis to model non-Gaussian behavior and are positive everywhere on the support by construction. We use Monte Carlo to approximate the expectation integrals that arise in the maximum likelihood estimation of SNP coefficients, and introduce a convex relaxation to generate effective initial estimates. The method is demonstrated on density and quantile estimation for the chaotic Lorenz system. The results demonstrate that the proposed method can accurately capture non-Gaussian density structure and compute quantiles using significantly fewer samples than raw Monte Carlo sampling.


tBayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data

arXiv.org Machine Learning

Time-series analysis is often affected by missing data, a common problem across several fields, including healthcare and environmental monitoring. Multiple Imputation by Chained Equations (MICE) has been prominent for imputing missing values through "fully conditional specification". We extend MICE using the Bayesian framework (tBayes-MICE), utilising Bayesian inference to impute missing values via Markov Chain Monte Carlo (MCMC) sampling to account for uncertainty in MICE model parameters and imputed values. We also include temporally informed initialisation and time-lagged features in the model to respect the sequential nature of time-series data. We evaluate the tBayes-MICE method using two real-world datasets (AirQuality and PhysioNet), and using both the Random Walk Metropolis (RWM) and the Metropolis-Adjusted Langevin Algorithm (MALA) samplers. Our results demonstrate that tBayes-MICE reduces imputation errors relative to the baseline methods over all variables and accounts for uncertainty in the imputation process, thereby providing a more accurate measure of imputation error. We also found that MALA mixed better than RWM across most variables, achieving comparable accuracy while providing more consistent posterior exploration. Overall, these findings suggest that the tBayes-MICE framework represents a practical and efficient approach to time-series imputation, balancing increased accuracy with meaningful quantification of uncertainty in various environmental and clinical settings.