Learning Graphical Models
Wasserstein-p Central Limit Theorem Rates: From Local Dependence to Markov Chains
Finite-time central limit theorem (CLT) rates play a central role in modern machine learning. In this paper, we study CLT rates for multivariate dependent data in Wasserstein-$p$ ($W_p$) distance, for general $p \geq 1$. We focus on two fundamental dependence structures that commonly arise in machine learning: locally dependent sequences and geometrically ergodic Markov chains. In both settings, we establish the first optimal $O(n^{-1/2})$ rate in $W_1$, as well as the first $W_p$ ($p\ge 2$) CLT rates under mild moment assumptions, substantially improving the best previously known bounds in these dependent-data regimes. As an application of our optimal $W_1$ rate for locally dependent sequences, we further obtain the first optimal $W_1$-CLT rate for multivariate $U$-statistics. On the technical side, we derive a tractable auxiliary bound for $W_1$ Gaussian approximation errors that is well suited for studying dependent data. For Markov chains, we further prove that the regeneration time of the split chain associated with a geometrically ergodic chain has a geometric tail without assuming strong aperiodicity or other restrictive conditions. These tools may be of independent interests and enable our optimal $W_1$ rates and underpin our $W_p$ ($p\ge 2$) results.
Riesz Representer Fitting under Bregman Divergence: A Unified Framework for Debiased Machine Learning
Estimating the Riesz representer is central to debiased machine learning for causal and structural parameter estimation. We propose generalized Riesz regression, a unified framework that estimates the Riesz representer by fitting a representer model via Bregman divergence minimization. This framework includes the squared loss and the Kullback--Leibler (KL) divergence as special cases: the former recovers Riesz regression, while the latter recovers tailored loss minimization. Under suitable model specifications, the dual problems correspond to covariate balancing, which we call automatic covariate balancing. Moreover, under the same specifications, outcome averages weighted by the estimated Riesz representer satisfy Neyman orthogonality even without estimating the regression function, a property we call automatic Neyman orthogonalization. This property not only reduces the estimation error of Neyman orthogonal scores but also clarifies a key distinction between debiased machine learning and targeted maximum likelihood estimation. Our framework can also be viewed as a generalization of density ratio fitting under Bregman divergences to Riesz representer estimation, and it applies beyond density ratio estimation. We provide convergence analyses for both reproducing kernel Hilbert space (RKHS) and neural network model classes. A Python package for generalized Riesz regression is available at https://github.com/MasaKat0/grr.
Horseshoe Mixtures-of-Experts (HS-MoE)
Horseshoe mixtures-of-experts (HS-MoE) models provide a Bayesian framework for sparse expert selection in mixture-of-experts architectures. We combine the horseshoe prior's adaptive global-local shrinkage with input-dependent gating, yielding data-adaptive sparsity in expert usage. Our primary methodological contribution is a particle learning algorithm for sequential inference, in which the filter is propagated forward in time while tracking only sufficient statistics. We also discuss how HS-MoE relates to modern mixture-of-experts layers in large language models, which are deployed under extreme sparsity constraints (e.g., activating a small number of experts per token out of a large pool).
MLCBART: Multilabel Classification with Bayesian Additive Regression Trees
Tian, Jiahao, Chipman, Hugh, Loughin, Thomas
Multilabel Classification (MLC) deals with the simultaneous classification of multiple binary labels. The task is challenging because, not only may there be arbitrarily different and complex relationships between predictor variables and each label, but associations among labels may exist even after accounting for effects of predictor variables. In this paper, we present a Bayesian additive regression tree (BART) framework to model the problem. BART is a nonparametric and flexible model structure capable of uncovering complex relationships within the data. Our adaptation, MLCBART, assumes that labels arise from thresholding an underlying numeric scale, where a multivariate normal model allows explicit estimation of the correlation structure among labels. This enables the discovery of complicated relationships in various forms and improves MLC predictive performance. Our Bayesian framework not only enables uncertainty quantification for each predicted label, but our MCMC draws produce an estimated conditional probability distribution of label combinations for any predictor values. Simulation experiments demonstrate the effectiveness of the proposed model by comparing its performance with a set of models, including the oracle model with the correct functional form. Results show that our model predicts vectors of labels more accurately than other contenders and its performance is close to the oracle model. An example highlights how the method's ability to produce measures of uncertainty on predictions provides nuanced understanding of classification results.
A Statistical Assessment of Amortized Inference Under Signal-to-Noise Variation and Distribution Shift
Shreshtth, Roy Shivam Ram, Hazra, Arnab, Mukherjee, Gourab
Since the turn of the century, approximate Bayesian inference has steadily evolved as new computational techniques have been incorporated to handle increasingly complex and large-scale predictive problems. The recent success of deep neural networks and foundation models has now given rise to a new paradigm in statistical modeling, in which Bayesian inference can be amortized through large-scale learned predictors. In amortized inference, substantial computation is invested upfront to train a neural network that can subsequently produce approximate posterior or predictions at negligible marginal cost across a wide range of tasks. At deployment, amortized inference offers substantial computational savings compared with traditional Bayesian procedures, which generally require repeated likelihood evaluations or Monte Carlo simulations for predictions for each new dataset. Despite the growing popularity of amortized inference, its statistical interpretation and its role within Bayesian inference remain poorly understood. This paper presents statistical perspectives on the working principles of several major neural architectures, including feedforward networks, Deep Sets, and Transformers, and examines how these architectures naturally support amortized Bayesian inference. We discuss how these models perform structured approximation and probabilistic reasoning in ways that yield controlled generalization error across a wide range of deployment scenarios, and how these properties can be harnessed for Bayesian computation. Through simulation studies, we evaluate the accuracy, robustness, and uncertainty quantification of amortized inference under varying signal-to-noise ratios and distributional shifts, highlighting both its strengths and its limitations.
Robust low-rank estimation with multiple binary responses using pairwise AUC loss
Multiple binary responses arise in many modern data-analytic problems. Although fitting separate logistic regressions for each response is computationally attractive, it ignores shared structure and can be statistically inefficient, especially in high-dimensional and class-imbalanced regimes. Low-rank models offer a natural way to encode latent dependence across tasks, but existing methods for binary data are largely likelihood-based and focus on pointwise classification rather than ranking performance. In this work, we propose a unified framework for learning with multiple binary responses that directly targets discrimination by minimizing a surrogate loss for the area under the ROC curve (AUC). The method aggregates pairwise AUC surrogate losses across responses while imposing a low-rank constraint on the coefficient matrix to exploit shared structure. We develop a scalable projected gradient descent algorithm based on truncated singular value decomposition. Exploiting the fact that the pairwise loss depends only on differences of linear predictors, we simplify computation and analysis. We establish non-asymptotic convergence guarantees, showing that under suitable regularity conditions, leading to linear convergence up to the minimax-optimal statistical precision. Extensive simulation studies demonstrate that the proposed method is robust in challenging settings such as label switching and data contamination and consistently outperforms likelihood-based approaches.
Sampling via Stochastic Interpolants by Langevin-based Velocity and Initialization Estimation in Flow ODEs
Duan, Chenguang, Jiao, Yuling, Steidl, Gabriele, Wald, Christian, Yang, Jerry Zhijian, Zhang, Ruizhe
We propose a novel method for sampling from unnormalized Boltzmann densities based on a probability-flow ordinary differential equation (ODE) derived from linear stochastic interpolants. The key innovation of our approach is the use of a sequence of Langevin samplers to enable efficient simulation of the flow. Specifically, these Langevin samplers are employed (i) to generate samples from the interpolant distribution at intermediate times and (ii) to construct, starting from these intermediate times, a robust estimator of the velocity field governing the flow ODE. For both applications of the Langevin diffusions, we establish convergence guarantees. Extensive numerical experiments demonstrate the efficiency of the proposed method on challenging multimodal distributions across a range of dimensions, as well as its effectiveness in Bayesian inference tasks.
Structural Dimension Reduction in Bayesian Networks
Heng, Pei, Sun, Yi, Guo, Jianhua
This work introduces a novel technique, named structural dimension reduction, to collapse a Bayesian network onto a minimum and localized one while ensuring that probabilistic inferences between the original and reduced networks remain consistent. To this end, we propose a new combinatorial structure in directed acyclic graphs called the directed convex hull, which has turned out to be equivalent to their minimum localized Bayesian networks. An efficient polynomial-time algorithm is devised to identify them by determining the unique directed convex hulls containing the variables of interest from the original networks. Experiments demonstrate that the proposed technique has high dimension reduction capability in real networks, and the efficiency of probabilistic inference based on directed convex hulls can be significantly improved compared with traditional methods such as variable elimination and belief propagation algorithms. The code of this study is open at \href{https://github.com/Balance-H/Algorithms}{https://github.com/Balance-H/Algorithms} and the proofs of the results in the main body are postponed to the appendix.
Reinforcement Learning for Micro-Level Claims Reserving
Avanzi, Benjamin, Richman, Ronald, Wong, Bernard, Wüthrich, Mario, Xie, Yagebu
Outstanding claim liabilities are revised repeatedly as claims develop, yet most modern reserving models are trained as one-shot predictors and typically learn only from settled claims. We formulate individual claims reserving as a claim-level Markov decision process in which an agent sequentially updates outstanding claim liability (OCL) estimates over development, using continuous actions and a reward design that balances accuracy with stable reserve revisions. A key advantage of this reinforcement learning (RL) approach is that it can learn from all observed claim trajectories, including claims that remain open at valuation, thereby avoiding the reduced sample size and selection effects inherent in supervised methods trained on ultimate outcomes only. We also introduce practical components needed for actuarial use -- initialisation of new claims, temporally consistent tuning via a rolling-settlement scheme, and an importance-weighting mechanism to mitigate portfolio-level underestimation driven by the rarity of large claims. On CAS and SPLICE synthetic general insurance datasets, the proposed Soft Actor-Critic implementation delivers competitive claim-level accuracy and strong aggregate OCL performance, particularly for the immature claim segments that drive most of the liability.
Online Markov Decision Processes with Terminal Law Constraints
Moreno, Bianca Marin, Brégère, Margaux, Gaillard, Pierre, Oudjane, Nadia
Traditional reinforcement learning usually assumes either episodic interactions with resets or continuous operation to minimize average or cumulative loss. While episodic settings have many theoretical results, resets are often unrealistic in practice. The infinite-horizon setting avoids this issue but lacks non-asymptotic guarantees in online scenarios with unknown dynamics. In this work, we move towards closing this gap by introducing a reset-free framework called the periodic framework, where the goal is to find periodic policies: policies that not only minimize cumulative loss but also return the agents to their initial state distribution after a fixed number of steps. We formalize the problem of finding optimal periodic policies and identify sufficient conditions under which it is well-defined for tabular Markov decision processes. To evaluate algorithms in this framework, we introduce the periodic regret, a measure that balances cumulative loss with the terminal law constraint. We then propose the first algorithms for computing periodic policies in two multi-agent settings and show they achieve sublinear periodic regret of order $\tilde O(T^{3/4})$. This provides the first non-asymptotic guarantees for reset-free learning in the setting of $M$ homogeneous agents, for $M > 1$.