Statistical Learning
Isomorphic Functionalities between Ant Colony and Ensemble Learning: Part II-On the Strength of Weak Learnability and the Boosting Paradigm
Fokoué, Ernest, Babbitt, Gregory, Levental, Yuval
In Part I of this series, we established a rigorous mathematical isomorphism between ant colony decision-making and random forest learning, demonstrating that variance reduction through decorrelation is a universal principle shared by biological and computational ensembles. Here we turn to the complementary mechanism: bias reduction through adaptive weighting. Just as boosting algorithms sequentially focus on difficult instances, ant colonies dynamically amplify successful foraging paths through pheromone-mediated recruitment. We prove that these processes are mathematically isomorphic, establishing that the fundamental theorem of weak learnability has a direct analog in colony decision-making. We develop a formal mapping between AdaBoost's adaptive reweighting and ant recruitment dynamics, show that the margin theory of boosting corresponds to the stability of quorum decisions, and demonstrate through comprehensive simulation that ant colonies implementing adaptive recruitment achieve the same bias-reduction benefits as boosting algorithms. This completes a unified theory of ensemble intelligence, revealing that both variance reduction (Part I) and bias reduction (Part II) are manifestations of the same underlying mathematical principles governing collective intelligence in biological and computational systems.
Convergence of projected stochastic natural gradient variational inference for various step size and sample or batch size schedules
Guilmeau, Thomas, Hendrikx, Hadrien, Forbes, Florence
Stochastic natural gradient variational inference (NGVI) is a popular and efficient algorithm for Bayesian inference. Despite empirical success, the convergence of this method is still not fully understood. In this work, we define and study a projected stochastic NGVI when variational distributions form an exponential family. Stochasticity arises when either gradients are intractable expectations or large sums. We prove new non-asymptotic convergence results for combinations of constant or decreasing step sizes and constant or increasing sample/batch sizes. When all hyperparameters are fixed, NGVI is shown to converge geometrically to a neighborhood of the optimum, while we establish convergence to the optimum with rates of the form $\mathcal{O}\left(\frac{1}{T^ρ} \right)$, possibly with $ρ\geq 1$, for all other combinations of step size and sample/batch size schedules. These rates apply when the target posterior distribution is close in some sense to the considered exponential family. Our theoretical results extend existing NGVI and stochastic optimization results and provide more flexibility to adjust, in a principled way, step sizes and sample/batch sizes in order to meet speed, resources, or accuracy constraints.
Aligning Validation with Deployment: Target-Weighted Cross-Validation for Spatial Prediction
Brenning, Alexander, Suesse, Thomas
Cross-validation (CV) is commonly used to estimate predictive risk when independent test data are unavailable. Its validity depends on the assumption that validation tasks are sampled from the same distribution as prediction tasks encountered during deployment. In spatial prediction and other settings with structured data, this assumption is frequently violated, leading to biased estimates of deployment risk. We propose Target-Weighted CV (TWCV), an estimator of deployment risk that accounts for discrepancies between validation and deployment task distributions, thus accounting for (1) covariate shift and (2) task-difficulty shift. We characterize prediction tasks by descriptors such as covariates and spatial configuration. TWCV assigns weights to validation losses such that the weighted empirical distribution of validation tasks matches the corresponding distribution over a target domain. The weights are obtained via calibration weighting, yielding an importance-weighted estimator that targets deployment risk. Since TWCV requires adequate coverage of the deployment distribution's support, we combine it with spatially buffered resampling that diversifies the task difficulty distribution. In a simulation study, conventional as well as spatial estimators exhibit substantial bias depending on sampling, whereas buffered TWCV remains approximately unbiased across scenarios. A case study in environmental pollution mapping further confirms that discrepancies between validation and deployment task distributions can affect performance assessment, and that buffered TWCV better reflects the prediction task over the target domain. These results establish task distribution mismatch as a primary source of CV bias in spatial prediction and show that calibration weighting combined with a suitable validation task generator provides a viable approach to estimating predictive risk under dataset shift.
Unbounded Density Ratio Estimation and Its Application to Covariate Shift Adaptation
Liu, Ren-Rui, Fan, Jun, Shi, Lei, Guo, Zheng-Chu
This paper focuses on the problem of unbounded density ratio estimation -- an understudied yet critical challenge in statistical learning -- and its application to covariate shift adaptation. Much of the existing literature assumes that the density ratio is either uniformly bounded or unbounded but known exactly. These conditions are often violated in practice, creating a gap between theoretical guarantees and real-world applicability. In contrast, this work directly addresses unbounded density ratios and integrates them into importance weighting for effective covariate shift adaptation. We propose a three-step estimation method that leverages unlabeled data from both the source and target distributions: (1) estimating a relative density ratio; (2) applying a truncation operation to control its unboundedness; and (3) transforming the truncated estimate back into the standard density ratio. The estimated density ratio is then employed as importance weights for regression under covariate shift. We establish rigorous, non-asymptotic convergence guarantees for both the proposed density ratio estimator and the resulting regression function estimator, demonstrating optimal or near-optimal convergence rates. Our findings offer new theoretical insights into density ratio estimation and learning under covariate shift, extending classical learning theory to more practical and challenging scenarios.
Kinetic Langevin Splitting Schemes for Constrained Sampling
Constrained sampling is an important and challenging task in computational statistics, concerned with generating samples from a distribution under certain constraints. There are numerous types of algorithm aimed at this task, ranging from general Markov chain Monte Carlo, to unadjusted Langevin methods. In this article we propose a series of new sampling algorithms based on the latter of these, specifically the kinetic Langevin dynamics. Our series of algorithms are motivated on advanced numerical methods which are splitting order schemes, which include the BU and BAO families of splitting schemes.Their advantage lies in the fact that they have favorable strong order (bias) rates and computationally efficiency. In particular we provide a number of theoretical insights which include a Wasserstein contraction and convergence results. We are able to demonstrate favorable results, such as improved complexity bounds over existing non-splitting methodologies. Our results are verified through numerical experiments on a range of models with constraints, which include a toy example and Bayesian linear regression.
Koopman Operator Identification of Model Parameter Trajectories for Temporal Domain Generalization (KOMET)
Hoover, Randy C., James, Jacob, May, Paul, Caudle, Kyle
Parametric models deployed in non-stationary environments degrade as the underlying data distribution evolves over time (a phenomenon known as temporal domain drift). In the current work, we present KOMET (Koopman Operator identification of Model parameter Evolution under Temporal drift), a model-agnostic, data-driven framework that treats the sequence of trained parameter vectors as the trajectory of a nonlinear dynamical system and identifies its governing linear operator via Extended Dynamic Mode Decomposition (EDMD). A warm-start sequential training protocol enforces parameter-trajectory smoothness, and a Fourier-augmented observable dictionary exploits the periodic structure inherent in many real-world distribution drifts. Once identified, KOMET's Koopman operator predicts future parameter trajectories autonomously, without access to future labeled data, enabling zero-retraining adaptation at deployment. Evaluated on six datasets spanning rotating, oscillating, and expanding distribution geometries, KOMET achieves mean autonomous-rollout accuracies between 0.981 and 1.000 over 100 held-out time steps. Spectral and coupling analyses further reveal interpretable dynamical structure consistent with the geometry of the drifting decision boundary.
Enhancing Online Support Group Formation Using Topic Modeling Techniques
Barman, Pronob Kumar, Reynolds, Tera L., Foulds, James
Online health communities (OHCs) are vital for fostering peer support and improving health outcomes. Support groups within these platforms can provide more personalized and cohesive peer support, yet traditional support group formation methods face challenges related to scalability, static categorization, and insufficient personalization. To overcome these limitations, we propose two novel machine learning models for automated support group formation: the Group specific Dirichlet Multinomial Regression (gDMR) and the Group specific Structured Topic Model (gSTM). These models integrate user generated textual content, demographic profiles, and interaction data represented through node embeddings derived from user networks to systematically automate personalized, semantically coherent support group formation. We evaluate the models on a large scale dataset from MedHelp, comprising over 2 million user posts. Both models substantially outperform baseline methods including LDA, DMR, and STM in predictive accuracy (held out log likelihood), semantic coherence (UMass metric), and internal group consistency. The gDMR model yields group covariates that facilitate practical implementation by leveraging relational patterns from network structures and demographic data. In contrast, gSTM emphasizes sparsity constraints to generate more distinct and thematically specific groups. Qualitative analysis further validates the alignment between model generated groups and manually coded themes, showing the practical relevance of the models in informing groups that address diverse health concerns such as chronic illness management, diagnostic uncertainty, and mental health. By reducing reliance on manual curation, these frameworks provide scalable solutions that enhance peer interactions within OHCs, with implications for patient engagement, community resilience, and health outcomes.
Parameter Estimation in Stochastic Differential Equations via Wiener Chaos Expansion and Stochastic Gradient Descent
Delgado-Vences, Francisco, Pavón-Español, José Julián, Ornelas, Arelly
This study addresses the inverse problem of parameter estimation for Stochastic Differential Equations (SDEs) by minimizing a regularized discrepancy functional via Stochastic Gradient Descent (SGD). To achieve computational efficiency, we leverage the Wiener Chaos Expansion (WCE), a spectral decomposition technique that projects the stochastic solution onto an orthogonal basis of Hermite polynomials. This transformation effectively maps the stochastic dynamics into a hierarchical system of deterministic functions, termed the \textit{propagator}. By reducing the stochastic inference task to a deterministic optimization problem, our framework circumvents the heavy computational burden and sampling requirements of traditional simulation-based methods like MCMC or MLE. The robustness and scalability of the proposed approach are demonstrated through numerical experiments on various non-linear SDEs, including models for individual biological growth. Results show that the WCE-SGD framework provides accurate parameter recovery even from discrete, noisy observations, offering a significant paradigm shift in the efficient modeling of complex stochastic systems.
Calorimeter Shower Superresolution with Conditional Normalizing Flows: Implementation and Statistical Evaluation
In High Energy Physics, detailed calorimeter simulations and reconstructions are essential for accurate energy measurements and particle identification, but their high granularity makes them computationally expensive. Developing data-driven techniques capable of recovering fine-grained information from coarser readouts, a task known as calorimeter superresolution, offers a promising way to reduce both computational and hardware costs while preserving detector performance. This thesis investigates whether a generative model originally designed for fast simulation can be effectively applied to calorimeter superresolution. Specifically, the model proposed in arXiv:2308.11700 is re-implemented independently and trained on the CaloChallenge 2022 dataset based on the Geant4 Par04 calorimeter geometry. Finally, the model's performance is assessed through a rigorous statistical evaluation framework, following the methodology introduced in arXiv:2409.16336, to quantitatively test its ability to reproduce the reference distributions.
A Hierarchical Sheaf Spectral Embedding Framework for Single-Cell RNA-seq Analysis
Wang, Xiang Xiang, We, Guo-Wei
Single-cell RNA-seq data analysis typically requires representations that capture heterogeneous local structure across multiple scales while remaining stable and interpretable. In this work, we propose a hierarchical sheaf spectral embedding (HSSE) framework that constructs informative cell-level features based on persistent sheaf Laplacian analysis. Starting from scale-dependent low-dimensional embeddings, we define cell-centered local neighborhoods at multiple resolutions. For each local neighborhood, we construct a data-driven cellular sheaf that encodes local relationships among cells. We then compute persistent sheaf Laplacians over sampled filtration intervals and extract spectral statistics that summarize the evolution of local relational structure across scales. These spectral descriptors are aggregated into a unified feature vector for each cell and can be directly used in downstream learning tasks without additional model training. We evaluate HSSE on twelve benchmark single-cell RNA-seq datasets covering diverse biological systems and data scales. Under a consistent classification protocol, HSSE achieves competitive or improved performance compared with existing multiscale and classical embedding-based methods across multiple evaluation metrics. The results demonstrate that sheaf spectral representations provide a robust and interpretable approach for single-cell RNA-seq data representation learning.