Statistical Learning
When to Trust Confidence Thresholding: Calibration Diagnostics for Pseudo-Labelled Regression
Calibrated probability outputs of trained classifiers are increasingly used as inputs to downstream regression estimands such as effects, prevalences, or disparities for a latent group observed only on a small labelled subset. A standard practice is to threshold the calibrated score at a confidence cutoff and treat the hard label as the truth. Building on a recent identification result for the underlying moment equation, we develop a calibration-aware diagnostic apparatus for pseudo-labelling pipelines. We derive a closed-form expression for the attenuation bias that confidence thresholding induces in the downstream regression coefficient, and show that the bias can be predicted, before any inference is run, from the residual score variance $V^{*}=\mathbb{E}[\operatorname{Var}(p\mid X)]$ on the unlabelled set after partialling out the downstream controls $X$. We further obtain a sharp sensitivity bound under bounded calibration drift, and identify the boundary $V^{*}=0$, which holds iff $p$ is a deterministic function of $X$; this motivates a structural separation between classifier features $W$ and downstream controls $X\subsetneq W$. Five controlled simulations and a UCI Adult illustration trace the predictions. The contribution is operational: a $(V^{*}, ฮบ)$ decision rule that practitioners can compute from any classifier output to decide whether confidence thresholding is safe.
Robust Sequential Experimental Design for A/B Testing
Wen, Qianglin, Wu, Xiangkun, Shi, Chengchun, Li, Ting, Tang, Niansheng, Zhang, Yingying, Zhu, Hongtu
Experimental design has emerged as a powerful approach for improving the sample efficiency of A/B testing, yet existing designs rely critically on correctly specified models. We study robust sequential experimental design under model misspecification and develop a unified framework that covers both contextual bandit and dynamic settings. Theoretically, we prove that our design bounds the worst-case mean squared error of the estimated treatment effect. Empirically, we demonstrate the effectiveness of the proposed approach using synthetic and real-world datasets from a leading technology company.
Enhancing a Risk Model by Adding Transient Statistical Factors
Tzikas, Alexandros E., Candรจs, Emmanuel J., Hastie, Trevor, Boyd, Stephen P., Kochenderfer, Mykel J., Kahn, Ronald N.
Estimating the covariance of asset returns, i.e., the risk model, is a key component of financial portfolio construction and evaluation. Most risk modeling approaches produce a factor model that decomposes the asset variability into two components: the first attributed to a small number of factors that are common among the assets and the second attributed to the idiosyncratic behavior of each asset. Third-party providers typically provide risk models to investors, and while these models are typically of high quality, they may fail to capture important information, e.g., changing market regimes and transient factors. To overcome these limitations, we propose a systematic method based on maximum likelihood estimation to enhance an existing factor model by both refining the given model and adding new statistical factors. Our approach relies only on the observed sequence of realized returns and on the choice of two hyperparameters: the number of additional factors and the half-life parameter that determines the weights assigned to returns in the log-likelihood objective. Importantly, our methodology applies to the situation where asset returns may be missing, making it suitable for typical equity datasets. We demonstrate our approach on the Barra short-term US risk model, a high-quality risk model used in practice, for a universe of US high-capitalization equities. We show that the proposed extension captures structure in the returns that is missed by the original model.
State-of-art minibatches via novel DPP kernels: discretization, wavelets, and rough objectives
Tran, Hoang-Son, Gupta, Pranav, Bardenet, Rรฉmi, Ghosh, Subhroshekhar
Determinantal point processes (DPPs) have emerged as a kernelized alternative to vanilla independent sampling for generating efficient minibatches, coresets and other parsimonious representations of large-scale datasets. While theoretical foundations and promising empirical performance have been demonstrated, there are two challenges for current proposals for DPP-based coresets or minibatches. The first is the need for families of DPPs with certain key variance reduction properties, usually constructed in a continuous setting, of which there are few known examples. The second is the need for an ad-hoc construction of a discrete DPP defined on a given dataset, that inherits such variance reduction. In this work, we contribute to the programme of establishing DPPs as a subsampling toolbox for ML by advancing on these two fronts. First, we propose new DPPs on the Euclidean space based on wavelets, with provably better accuracy guarantees than the best known rates. Second, we introduce a general method to convert such continuous DPPs, which are more amenable to proving analytical statements, into discrete kernels, which are pertinent for subsampling tasks such as minibatch and coreset constructions. This conversion mechanism simultaneously preserves the desired variance decay and reveals a low-rank decomposition of the discrete kernel, which makes sampling the corresponding DPP computationally inexpensive. En route, we enlarge the class of ML tasks amenable to improvements via DPP-based minibatches and coresets to include objective functions with arbitrarily low regularity, and rate guarantees that explicitly adapt to this regularity.
Amortized Neural Clustering of Time Series based on Statistical Features
Lรณpez-Oriona, รngel, Sun, Ying
This paper introduces an algorithm-agnostic approach to feature-based time series clustering via amortized neural inference. By training neural networks to approximate the optimal partitioning rule from simulated data, the proposed framework reduces reliance on conventional clustering methods, such as $K$-means, $K$-medoids, or hierarchical clustering, and their associated objective functions and heuristics. Leveraging statistical features, such as autocorrelations and quantile autocorrelations, the approach learns a data-driven affinity structure from which clustering partitions can be recovered, without requiring explicit prior specification of cluster shapes or structures. In addition, one version of the method can automatically determine the number of clusters, avoiding ad-hoc selection procedures. Comprehensive empirical studies show that the proposed framework achieves competitive or superior clustering accuracy relative to traditional methods, even in challenging scenarios where competing techniques are provided with the true number of clusters. An application to financial time series of stock returns illustrates its practical utility. By reducing the need for algorithm selection and calibration, the proposed framework opens new possibilities for automated, adaptive, and data-driven clustering of temporal data across scientific and industrial domains.
Kernel-based guarantees for nonlinear parametric models in Bayesian optimization
Modern Bayesian optimization and adaptive sampling methods increasingly rely on nonlinear parametric models, yet theoretical guarantees for such models under adaptive data collection remain limited. Existing analyses largely focus on Gaussian processes, kernel machines, linear models, or linearized neural approximations, leaving a gap between theory and the nonlinear models used in practice. We develop a kernel-based framework for analyzing regularized nonlinear parametric models trained on adaptively collected data. Our approach uses kernels over the parameter space to induce reproducing-kernel Hilbert space structures over the corresponding model class, yielding confidence bounds for models trained with broad classes of regularized convex losses. We show how these bounds can support convergence guarantees for nonlinear acquisition and surrogate models, including randomized regularized policies that select points by maximizing a trained random model. These results provide a unified route to analyzing nonlinear parametric models in Bayesian optimization and related adaptive optimization settings.
Coupling-Informed Transport Maps for Bayesian Filtering in Nonlinear Dynamical Systems
Zeng, Dengfei, Jiang, Lijian, Sun, Shuyu, Xiao, Dunhui
A likelihood-free transport filtering method is proposed based on the couplings between state and observation variables. By exploiting a block-triangular structure in the transport map, the analysis step of filtering is reformulated as the minimization of the maximum mean discrepancy (MMD) between the true joint measure and its transport-based approximation. To circumvent the non-convexity in the MMD optimization, we introduce a training-free transport filter method via gradient flows, which leads to an analytic computation for the transport map that implies the steepest descent direction of the MMD. The proposed approach accurately approximates non-Gaussian filtering posteriors and avoids particle collapse. We provide a convergence analysis for the expectation of the MMD between the approximated posterior and the truth posterior. Finally, we extend the method to high-dimensional problems through domain localization. Numerical examples demonstrate the superior performance of our approach over conventional filtering methods in nonlinear, non-Gaussian scenarios.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Mahran, Ammar, Maranjyan, Artavazd, Richtรกrik, Peter
Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient arrives. Vanilla ASGD applies each arriving gradient with the same weight. When local data distributions are heterogeneous, this becomes problematic: faster workers contribute more updates, and we show theoretically that the method is biased toward a frequency-weighted average of the local objectives rather than the desired global objective. Existing remedies typically move away from the simple ASGD template by introducing gathering phases, buffering, or extra memory. We show that this is unnecessary. Keeping the standard ASGD mechanism, we recover the correct objective by rescaling worker-specific stepsizes in proportion to their computation times, so that each worker contributes the same aggregate learning rate over a cycle. In the non-convex setting, under smoothness and bounded heterogeneity assumptions, we prove that the resulting method, Rescaled ASGD, converges to stationary points of the correct global objective in the fixed-computation model. Its time complexity matches the known lower bound in the leading term, while the effects of staleness and data heterogeneity appear only in lower-order terms. Experiments confirm that the method converges to the correct objective and is competitive with state-of-the-art baselines.
Sensor Design for Accuracy-Bounded Estimation via Maximum-Entropy Likelihood Synthesis
Designing the sensing architecture for large-scale spatio-temporal systems is hard when accuracy requirements are specified but sensor models are uncertain or unavailable. Classical design treats sensor placement and estimation sequentially, requiring valid forward models for each sensing modality. This paper inverts the design flow: given an error budget, synthesize the measurement likelihood that enforces it while injecting minimal information beyond the dynamical prior. The likelihood is constructed by constrained optimization: among all posteriors satisfying a prescribed accuracy bound relative to a target, select the one minimizing Kullback-Leibler divergence from the prior. The solution is a maximum-entropy posterior in relative-entropy form, and the induced likelihood is the Radon-Nikodym derivative. The framework accommodates arbitrary discrepancies and is instantiated for Wasserstein distance, maximum mean discrepancy, $f$-divergences, moment constraints, and hybrid metrics. For each, we derive the discrete particle-level problem, analyze its convex or convex-relaxed structure, and present solvers with complexity scaling. A closed-form solution exists for the symmetric exponential-tilt case, and a distillation procedure converts nonparametric likelihood samples into parametric forms. A two-layer sensor design architecture embeds the synthesized likelihood in the recursive predict-update loop, connecting accuracy budgets to physical sensor placement, precision, and configuration. Numerical experiments comparing four metrics on unimodal and multimodal scenarios confirm the accuracy constraints are reliably enforced and reveal how metric choice determines the amount and spatial distribution of injected information.
Interpretable Machine Learning for Spatial Science: A Lie-Algebraic Kernel for Rotationally Anisotropic Gaussian Processes
Warrior, Kane, Chakrabarty, Dalia
Many three-dimensional spatial fields are anisotropic, with directions of rapid and slow variation that need not align with the coordinate axes. Standard Gaussian process kernels with Automatic Relevance Determination (ARD) capture only axis-aligned anisotropy, while generic full symmetric positive definite (SPD) metrics can represent rotated anisotropy but do not parameterise principal length-scales and directions directly. We introduce an interpretable rotationally anisotropic GP kernel that parameterises a three-dimensional SPD covariance metric using three principal length-scales and an explicit SO(3) rotation. The rotation is represented by an axis-angle vector and mapped to SO(3) via the Lie-algebra exponential map, giving unconstrained Euclidean coordinates for inference while always inducing a valid SPD metric. The construction spans the same family of three-dimensional SPD covariance metrics as a generic full-SPD parameterisation, but exposes the geometry differently: length-scales and orientation are explicit, interpretable, and directly available for prior specification and posterior summaries. We perform Bayesian inference on these quantities using Markov Chain Monte Carlo (MCMC), and characterise the resulting symmetries and weakly identified regimes. On synthetic data with rotated anisotropy, the posterior recovers the generating metric and improves prediction relative to an axis-aligned ARD baseline, while matching the predictive performance of a generic full SPD baseline. When the ground truth is axis-aligned, posterior mass concentrates near the identity rotation and predictive performance matches ARD. On a material-density dataset from a laboratory-fabricated nano-brick, the inferred metric reveals rotated anisotropy that is not captured by axis-aligned kernels.