Statistical Learning
Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification
Rodemann, Julian, Marquard, Alexander, Augustin, Thomas, Caprio, Michele
Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main object of interest is often a model's predictions rather than its parameters. In this work, we propose to bypass the parameter posterior and focus directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-training within self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model's predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. This yields a deterministic, sampling-free approximation of the posterior predictive. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows us to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically in regression models ranging from Bayesian linear models to Bayesian neural networks. Across a wide array of regression tasks with simulated and real-world datasets, our methods outperform classical Laplace approximations in predictive calibration while remaining computationally efficient.
Multi-Variable Conformal Prediction: Optimizing Prediction Sets without Data Splitting
Lรผtzow, Laura, Garatti, Simone, Campi, Marco C., Lindemann, Lars, Althoff, Matthias
Conformal prediction constructs prediction sets with finite-sample coverage guarantees, but its calibration stage is structurally constrained to a scalar score function and a single threshold variable -- forcing shapes of prediction sets to be fixed before calibration, typically through data splitting. We introduce multi-variable conformal prediction (MCP), a framework that extends conformal prediction to vector-valued score functions with multiple simultaneous calibration variables. Building on scenario theory as a principled framework for certifying data-driven decisions, MCP unifies prediction set design and calibration into a single optimization problem, eliminating data splitting without sacrificing coverage guarantees. We propose two computationally efficient variants: RemMCP, grounded in constrained optimization with constraint removal, which admits a clean generalization of split conformal prediction; and RelMCP, based on iterative optimization with constraint relaxation, which supports non-convex score functions at the cost of possibly greater conservatism. Through numerical experiments on ellipsoidal and multi-modal prediction sets, we demonstrate that RemMCP and RelMCP consistently meet the target coverage with prediction set sizes smaller than or comparable to those of baselines with data split, while considerably reducing variance across calibration runs -- a direct consequence of using all available data for shape optimization and calibration simultaneously.
Model-based Bootstrap of Controlled Markov Chains
Su, Ziwei, Banerjee, Imon, Klabjan, Diego
We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.
A proximal gradient algorithm for composite log-concave sampling
We propose an algorithm to sample from composite log-concave distributions over $\mathbb{R}^d$, i.e., densities of the form $ฯ\propto e^{-f-g}$, assuming access to gradient evaluations of $f$ and a restricted Gaussian oracle (RGO) for $g$. The latter requirement means that we can easily sample from the density $\text{RGO}_{g,h,y}(x) \propto \exp(-g(x) -\frac{1}{2h}||y-x||^2)$, which is the sampling analogue of the proximal operator for $g$. If $f + g$ is $ฮฑ$-strongly convex and $f$ is $ฮฒ$-smooth, our sampler achieves $\varepsilon$ error in total variation distance in $\widetilde{\mathcal O}(ฮบ\sqrt d \log^4(1/\varepsilon))$ iterations where $ฮบ:= ฮฒ/ฮฑ$, which matches prior state-of-the-art results for the case $g=0$. We further extend our results to cases where (1) $ฯ$ is non-log-concave but satisfies a Poincarรฉ or log-Sobolev inequality, and (2) $f$ is non-smooth but Lipschitz.
A Mean Curvature Approach to Boundary Detection: Geometric Insights for Unsupervised Learning
Accurate boundary detection in high-dimensional data remains a central challenge in unsupervised learning, particularly in the presence of non-linear structures and heterogeneous densities. In this work, we introduce Mean Curvature Boundary Points (MCBP), a novel geometric framework grounded in Geometric Machine Learning that departs from traditional density-based approaches by explicitly modeling the intrinsic curvature of the data manifold. The method relies on a discrete approximation of the shape operator, estimated from local k-nearest neighbor patches, to compute pointwise mean curvature without requiring explicit manifold parametrization. The key insight of MCBP is to use mean curvature as a principled descriptor of boundary structure: high-curvature regions naturally correspond to transitions between clusters, geometric irregularities, and low-density interfaces. This yields a unified geometric interpretation of boundary, outlier, and transition points. We further introduce an adaptive percentile-based thresholding scheme that enables multiscale boundary extraction without relying on ad hoc density parameters. Beyond detection, we propose a curvature-driven data decomposition that separates samples into smooth (low-curvature) and boundary (high-curvature) subsets, effectively acting as a non-linear geometric filtering mechanism. This representation enhances cluster separability and improves the robustness of downstream unsupervised algorithms. Extensive experiments on synthetic and real-world datasets demonstrate that MCBP consistently improves clustering performance, particularly in complex and high-dimensional scenarios. These results position MCBP as a concrete contribution to Geometric Machine Learning, highlighting the potential of curvature-aware analysis as a unifying paradigm bridging differential geometry and data-driven modeling.
Fourier Feature Methods for Nonlinear Causal Discovery: FFML Scoring, TRFF Scoring, and FFCI Testing in Mixed Data
Gaussian process (GP) marginal likelihood scores and kernel conditional independence tests are theoretically appealing for nonlinear causal discovery but computationally prohibitive at scale. We present three complementary RFF-based methods forming a practical toolkit for score-based, constraint-based, and hybrid causal discovery. The Fourier Feature Marginal Likelihood (FFML) score approximates the exact GP marginal likelihood by replacing the $n x n$ kernel Gram matrix with a finite-dimensional feature representation, reducing cost to $O(nm^2 + m^3)$ while retaining the probabilistic interpretation and automatic complexity penalty of the exact score. FFML extends to mixed (continuous and discrete) parent sets via a product-kernel construction, with a Kronecker path for small discrete parent sets and a Hadamard-product path otherwise. The Tetrad Random Fourier Feature (TRFF) score is a complementary BIC-style alternative using penalized Student-t regression with random Fourier features. TRFF offers robustness to heavy-tailed noise and faster runtime than FFML. Empirically, TRFF and FFML exhibit a complementary precision-recall profile: TRFF achieves higher precision while FFML achieves better recall and lower SHD overall. The Fourier Feature Conditional Independence (FFCI) test is a fast nonparametric CI test for mixed data, using ridge residualization in feature space and a Frobenius-norm cross-covariance statistic approximated as a weighted sum of chi-squared variables. Empirically, BOSS+FFML achieves the lowest SHD on nonlinear data, while BOSS+TRFF offers the highest precision. When run through PC-Max, FFCI and RCIT exhibit complementary precision-recall profiles: RCIT is more precise while FFCI achieves better recall and substantially lower SHD, at approximately twice the runtime.
A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning
Hieu, Nong Minh, Ledent, Antoine
Contrastive Representation Learning (CRL) has achieved strong empirical success in multiple machine learning disciplines, yet its theoretical sample complexity remains poorly understood. Existing analyses usually assume that input tuples are identically and independently distributed, an assumption violated in most practical settings where contrastive tuples are constructed from a finite pool of labeled data, inducing dependencies among tuples. While one recent work analyzed this learning setting using U-Statistics to estimate the population risk, the techniques used therein require the risk of each class to concentrate uniformly, making excess risk bounds scale in the order of $ฯ_{\min}^{-{1}/{2}}$ where $ฯ_{\min}$ denotes the probability of the rarest class. Such a dependency can be overly pessimistic in the extreme multiclass settings where there are many tail classes which contribute minimally to the overall population risk. Our contributions are two-fold. Firstly, we improve upon the previous work and prove a bound with a sample complexity of the same order as the number of classes $R$, regardless of the distribution over classes. Furthermore, we formulate a different estimator that captures the concentration of the risk \textit{across classes}, enabling sharper bounds in extreme multi-class learning scenarios, especially where class distributions are long-tailed. Under mild assumptions on the class distributions, the resulting sample complexity is $\mathcal{O}(k)$ where $k$ is the number of samples per tuple.
Path-Based Gradient Boosting for Graph-Level Prediction
Meggio, Claudio, Pensar, Johan, De Bin, Riccardo
We propose PathBoost, a gradient tree boosting method for graph-level classification and regression that learns discriminative path-based features directly from the input graph structure. Building on a previous work, which was tailored to a specific chemistry application, PathBoost introduces three key extensions: (i) adaptation to binary classification through gradient boosting with a logistic loss, (ii) incorporation of multiple node and edge attributes into the path feature space via a prefix-based decomposition, and (iii) automatic anchor node selection based on categorical attribute diversity, eliminating the need for the user to specify the starting point of the considered path features. We compared PathBoost to graph neural networks and graph kernel approaches on several benchmark datasets, obtaining better results in half of them, and comparable results in the rest. PathBoost shows better performances on graphs with larger average node counts. Overall, the results demonstrate that path-based boosting methods can be competitive with more complex black-box approaches.
On Observation Time for Recovering Latent Hawkes Networks
Linkerhรคgner, Jonas, Bortolasi, Michele, Baldassari, Lorenzo, de Hoop, Maarten V., Dokmaniฤ, Ivan
Dynamics of interacting systems in engineering, society, and nature often evolve over latent networks that govern which entities can interact. We study the problem of inferring these networks from event-based observations, which arise naturally in finance, seismology, and neuroscience. While there is substantial algorithmic work addressing this important problem, theoretical results are scarce. In this paper we ask the following fundamental question: what is the minimum time that one must observe the dynamics in order to exactly recover the underlying network, as a function of the number $d$ of interacting entities? For a class of stationary Hawkes processes with sparse, weak interactions, we prove that an observation time of order $\log d$ is sufficient and necessary. For the upper bound we construct a two-stage estimator that uses clipped and binned event data for screening, followed by a least-squares refinement, and apply concentration bounds derived from the Poisson cluster representation. For the lower bound we combine Fano's inequality with Jacod's Girsanov formula for point processes on a suitable subclass of networks.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
Vaidya, Omatharv Bharat, Jerzak, Connor T., Ho, Nhat, Bajaj, Chandrajit
We present a data-adaptive method for parameter-efficient fine-tuning of large neural networks. Standard low-rank adaptation methods improve efficiency by restricting each layer update to a fixed low-rank form, but this static parameterization can be too rigid when the appropriate correction depends on the input and on the evolving depth-wise computation of the network. Our approach replaces a purely layer-local adapter with a shared queryable memory of low-rank update atoms. For each block of layers, the model forms a query from the current low-rank state and a running summary of previous blocks, uses this query to retrieve a content-dependent combination of shared update components via attention, and applies the resulting routed operator within the low-rank bottleneck. In this way, the method retains the efficiency and scalability of low-rank adaptation while allowing the effective update to vary across inputs and to share reusable structure across layers. The resulting architecture provides a principled middle ground between static LoRA-style updates and fully generated parameter updates: it remains compact and parameter-efficient while supporting dynamic, context-sensitive adaptation. Further, we incorporate instruction-regularization by augmenting routing logits with a language-induced prior over update atoms, thereby biasing the selection of low-rank transformations toward semantically relevant directions without generating unconstrained parameter updates. Experiments on noisy non-linear regression tasks and LLM fine-tuning suggest that this queryable update-memory formulation can improve final test performance and training stability compared to standard low-rank adaptation, while using a comparable number of trainable parameters.