Genre
Partially Performative Prediction
Performative prediction studies feedback loops that arise when predictive models are deployed in consequential domains. In these settings, deploying a model can change the population whose patterns the model aims to predict, inducing a distribution shift that is endogenous to the learning system. This perspective departs from classical treatments of distribution shift, where shifts are typically modeled as exogenous changes in the data-generating process. Yet, in practice, distribution shift is rarely one or the other. Predictive models may influence future data through the decisions they support, while the world itself continues to drift for reasons beyond the learner's control. We study partially performative prediction, a framework that captures both endogenous and exogenous sources of distribution shift. The framework generalizes performative prediction by allowing the data distribution to evolve both in response to the deployed model and according to an external, time-varying process. We extend the central notions of performative stability and performative optimality to this setting by defining their online analogues that track the evolving partially performative environment. We analyze practical learning heuristics, including repeated retraining, and characterize when they successfully adapt to partially performative environments.
Fast and Robust Convergence Rate for TD(0) with Linear Function Approximation, Universal Learning Steps and I.I.D. Samples
Kobeissi, Ziad, Berthier, Éloïse
In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning step, and the Polyak-Juditsky averaging method. We establish a new convergence rate, for the Mean-Square Error (MSE) on the approximated function, that is (i) fast in the sense that it admits an optimal dependency in the number of iterations k (i.e., of order 1/k), (ii) robust to ill-conditioning: it only depends on an initial error and modelindependent constants and (iii) sharp up to a multiplicative constant lower than 11. In particular, it does not depend on the smallest eigenvalue of the uncentered covariance matrix of the linear parametrization, unlike all pre-existing O(1/k) rates in the TD(0) literature. We also introduce PCTD(0), a variant of TD(0), which benefits from better convergence properties under an additional assumption of strong mixing on the Markov Chain.
Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods
Zhou, Junyu, Wang, Puyu, Lei, Yunwen, Kloft, Marius, Ying, Yiming
A central mystery in deep learning is how neural networks, despite being highly non-convex and heavily overparameterized, are able to achieve near-zero training error while still generalizing well to unseen data. This paradox has sparked a surge of research aimed at understanding the convergence and generalization behavior of neural networks [1, 2, 6, 7, 15, 38, 41, 49]. The Neural Tangent Kernel (NTK), introduced by [20], has become one of a foundational tool for understanding the behavior of training dynamics for neural networks, especially those trained using gradient-based methods such as gradient descent (GD) and stochastic gradient descent (SGD). The core idea here is to linearize the neural network around its random initialization, which enables the evolution of the network during training to be closely approximated by a kernel method associated with the corresponding NTK. This framework establishes a powerful connection between the evolution of a neural network during training process and the behavior of kernel methods in a reproducing kernel Hilbert space (RKHS) induced by the NTK, allowing insights from the kernel methods to inform our understanding of neural networks. Following this perspective, the influential work [34] showed that for regression problems, shallow neural networks trained by SGD can achieve generalization performance on par with their kernel counterparts.
The Effect of Training Task Diversity on In-Context Learning through the Lens of Low-Dimensional Subspaces
Kwon, Soo Min, Xu, Alec S., Yaras, Can, Song, Dogyoon, Balzano, Laura, Qu, Qing
The transformer's emergent ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its underlying mechanisms. Existing works often study how training task diversity, defined either as the number of ICL training task vectors or as the number of function classes from which the task vectors are drawn, shapes both the learning dynamics and generalization capabilities of ICL. While both definitions have uncovered many interesting phenomena, many observations under the latter definition remain theoretically unexplained. This paper presents a minimal analytical model under which these phenomena provably emerge from the properties of the training data. By modeling the training task vectors as a mixture of low-rank Gaussians, we show how training task diversity, defined by the number of non-overlapping columns between subspaces that parameterize the covariance matrices, improves both the generalization and optimization trajectory of ICL with linear attention. In particular, we show that our model can explain (i) why training with task diversity shortens the ICL plateau and (ii) why ICL appears to achieve out-of-distribution generalization. We conclude by empirically demonstrating how our results extend to nonlinear transformers and nonlinear function classes. Overall, our work presents a tractable framework to unify existing observations.
Stability beyond Bounded Differences: Sharp Generalization Bounds under Finite $L_p$ Moments
Lei, Qianqian, Bonnerjee, Soham, Han, Yuefeng, Wu, Wei Biao
While algorithmic stability is a central tool for understanding generalization of learning algorithms, existing high-probability guarantees typically rely on uniform boundedness or sub-Gaussian/sub-Weibull tail assumptions, which can be overly restrictive for modern settings with heavy-tailed or unbounded losses. We develop a stability-based framework that requires only a finite $L_p$ moment condition. Our first contribution is sharp concentration inequalities for functions of independent random variables under $L_p$ constraints, extending McDiarmid's bounded-differences techniques beyond the classical regime. Leveraging these results, we derive sharp high-probability generalization bounds across a range of learning paradigms, including empirical risk minimization, transductive regression, and meta-learning. These guarantees show that $L_p$ stability suffices for robust generalization even when boundedness fails, substantially weakening the standard assumptions in the stability literature.
Empirical Transfer Operators and Finite-Sample Change Detection for Noisy Expanding Interval Maps
We study a finite-sample change-detection problem for one-dimensional noisy dynamical systems using partition-based empirical approximations of stationary behaviour. Given observations from an interval-valued process, we partition the state space into finitely many intervals and estimate a transition matrix from observed transitions between partition elements. After a small Doeblin-type regularisation, the resulting matrix has a unique stationary distribution. This stationary distribution is used as a finite-dimensional approximation of the invariant density, or stationary law, of the observed regime. Using an initial reference segment, we compute a baseline empirical stationary distribution bπ0,ρ. For each subsequent sliding window, we compute a window-based empirical stationary distribution bπt,ρ and define the score St = bπt,ρ bπ0,ρ 1. Large values of St indicate that the stationary behaviour of the observed regime has changed relative to the baseline. The statistic is therefore a detector of changes in stationary behaviour. It is not, by itself, a detector of all possible changes in transition dynamics that preserve the invariant density.
The Sharp Phase Transition of Tyler's M-Estimator for Robust Subspace Recovery
Robust Subspace Recovery (RSR) aims to identify an underlying d-dimensional subspace from a dataset heavily corrupted by outliers. Complexity-theoretic results establish a threshold for the problem's computational hardness based on the dimensionscaled signal-to-noise ratio (DS-SNR): the problem is SSE-hard when the DS-SNR is strictly less than 1, and solvable via practical algorithms when it is greater than 1 under general position assumptions. However, the exact behavior of practical algorithms at the critical boundary DS-SNR = 1 has remained unknown. Specifically, we prove that TME converges exactly to the true subspace for DS-SNR 1 under a new stability condition, which is less restrictive than the general position assumptions used in prior literature. I. Introduction Robust Subspace Recovery (RSR) is a fundamental problem in robust statistics, machine learning, and computer vision. The primary goal of RSR is to identify an underlying low-dimensional linear subspace from a dataset that is heavily corrupted by outliers. The standard formulation of the noiseless RSR problem assumes a dataset X = {xi}Ni=1 RD consisting of n1 inliers lying exactly on a d-dimensional linear subspace L RD, and n0 outliers lying strictly off L . We refer to such a dataset as a noiseless inlier-outlier dataset, where the total number of points is N = n0 +n1. The central algorithmic question in noiseless RSR is under what conditions one can exactly and efficiently recover the underlying d-subspace L . A natural metric for characterizing the difficulty of this problem is the ratio of inliers to outliers, n1/n0, which can be viewed as a signal-to-noise ratio (SNR) [8], [11], [12]. This leads to the dimension-scaled SNR (DS-SNR), denoted by δS: δS:= n1/d n0/(D d) . Hardt and Moitra [5] established a fundamental lower bound, showing that when δS < 1, the noiseless RSR problem is Small Set Expansion (SSE)-hard, a property conjectured to be equivalent to NP-hardness [15]. In the special case of hyperplanes (d = D 1), they showed NP-hardness by invoking a result from [7]. The noiseless RSR problem is SSE-hard if δS < 1.
Gaussian Process Latent Factor Regression for Low-Data, High-Dimensional Output Problems
Stevenson, Edward T., Wolf, Eric T., Mak, Mei Ting, Mayne, N. J., Cranmer, Miles
In the sciences, regression tasks often require predicting high-dimensional outputs from few training examples. Multi-output Gaussian processes excel in low-data regimes but typically struggle with high-dimensional outputs. Compress-then-predict pipelines such as PCA-GP (principal component analysis plus Gaussian process regression) handle high dimensionality, but rely on bases optimized for reconstruction rather than prediction. To address this gap, we propose a model that represents each output as a linear-Gaussian decoding of a low-dimensional latent state drawn from a Gaussian process prior. By analytically marginalizing the decoder weights, we couple compression and prediction in a single objective that scales to high-dimensional outputs. We refer to this model as Gaussian process latent factor regression (GPLFR). We demonstrate GPLFR by building the first spatially resolved emulator of global climate models for rocky exoplanets.
Deep Single-Index Fréchet Regression
Cui, Muqing, Zhou, Yidong, Iao, Su I, Müller, Hans-Georg
Predicting outputs that are located in non-Euclidean spaces, such as probability distributions, networks, and symmetric positive-definite matrices, is becoming increasingly important in modern data analysis, particularly when inputs are high-dimensional. We propose DeSI (Deep Single-Index Fréchet Regression), a semiparametric framework for regression with metric space-valued outputs and multivariate inputs that assumes a single-index structure for the conditional Fréchet mean. DeSI estimates an interpretable index direction, which quantifies the relative importance of inputs, using a deep neural network, and performs Fréchet regression along the resulting one-dimensional index in the target metric space. This structure mitigates the curse of dimensionality while retaining interpretability, which stands in contrast to standard deep neural networks. We establish theoretical guarantees for DeSI, including uniform approximation and convergence rates, and demonstrate its strong predictive performance through simulations on distributions, networks, and symmetric positive-definite matrices, as well as an application to compositional mood data from New Jersey.
Information-Theoretic Bounds for Sparse Covariance Estimation in the Vertical-Split Distributed Model
We study the minimax estimation error for distributed covariance matrix estimation in the vertical-split (feature-split) setting, where two agents each observe different coordinates of $m$ i.i.d. sub-Gaussian samples and communicate a limited number of bits to a central server. While Rahmani et al. [2025] established nearly tight bounds for dense (unstructured) cross-covariance matrices, we investigate whether imposing elementwise $s$-sparsity on the cross-covariance $C_{21}$ can reduce the required communication and sample complexity. In contrast to the horizontal-split setting, where Braverman et al. [2016] showed that sparsity does not reduce communication cost for mean estimation, we prove that sparsity does help for cross-covariance estimation in the vertical split. Specifically, we establish minimax lower bounds showing that the communication budget per agent scales as $B_k = Ω(σ^4 d_k\, s' \log(d_1 d_2/s')/\varepsilon^2)$ and the sample complexity for cross-covariance estimation as $m = Ω(σ^4\, s' \log(d_1 d_2/s')/\varepsilon^2)$, where $s' = s \wedge d_{\min}$. For the $1$-sparse case, this yields an exponential improvement from $d_1 d_2$ to $\log(d_1 d_2)$ compared to the dense rate. Our lower bounds are established via Fano's method with an explicit sparse packing using a Varshamov--Gilbert-type argument for signed partial permutation matrices combined with the Conditional Strong Data Processing Inequality of Rahmani et al. [2025]. We show the bounds are tight with a matching achievable scheme, based on covering-net quantization and entry-wise hard thresholding, that attains the $s$-sparse lower bound up to polylogarithmic factors.