Country
Spectrum-Adaptive Generalization Bounds for Trained Deep Transformers
Sakai, Mana, Imaizumi, Masaaki
Understanding why trained Transformers generalize well is a fundamental problem in modern machine learning theory, and complexity-based generalization bounds provide a principled way to study this question. While existing norm-based bounds for Transformers remove the explicit polynomial dependence on the hidden dimension, they typically impose fixed norm constraints specified a priori and can exhibit unfavorable exponential dependence on depth. In this paper, we derive spectrum-adaptive post hoc generalization bounds for multi-layer Transformers. Under layerwise spectral norm control, the bounds are expressed in terms of layerwise Schatten quantities of the query-key, value, and feedforward weight matrices. Since the Schatten indices need not be fixed a priori and can instead be selected after training, separately for each matrix type and layer, the bounds adaptively trade off spectral complexity against the dimension- and depth-dependent factors according to the learned singular-value profiles. Empirical comparisons of BERT-adapted proxies for the leading complexity factors suggest that the proxies induced by our bounds grow more slowly with depth and hidden dimension than the corresponding norm-based proxies. Overall, our results provide a complexity-based perspective on how the spectral structure of trained Transformers is reflected in generalization analyses.
Robust Tensor Regression with Nonconvexity: Algorithmic and Statistical Theory
Song, Zihao, Liu, Jicai, Lian, Heng, Zhao, Weihua
Tensor regression is an important tool for tensor data analysis, but existing works have not considered the impact of outliers, making them potentially sensitive to such data points. This paper proposes a low tubal rank robust regression method for analyzing high-dimensional tensor data with heavy-tailed random noise. The proposed method is based on a nonconvex relaxation of the tensor tubal rank within a general optimization framework, which allows for nonconvexity in both the loss and penalty functions. We develop an implementable estimation algorithm and establish its global convergence under some mild assumptions. Furthermore, we provide general statistical theories regarding stationary point, including the rates of convergence and bounds on the prediction error. These theoretical results cover many important models, such as linear models, generalized linear models, and Huber regression, and even encompass some nonconvex losses like correntropy and minimum distance criterion-induced losses. Supportive numerical evidence is provided through simulations and application studies.
Statistical Convergence of Spherical First Hitting Diffusion Models
Bienewald, Simon, Trottner, Lukas
Denoising diffusion models have evolved into a state-of-the-art method for tasks in various fields, such as denoising and generation of images, text generation, or generation of synthetic data for training of other machine learning models. First hitting diffusion models (FHDM) are a particular class of denoising diffusion models with \textit{random} adaptive generation time tailored to generate data on a known manifold. Building on the conditioning framework of Doob's $h$-transform these models leverage the given information on the target data manifold to demonstrate strong performance across tasks while offering distinct features such as time-homogeneous dynamics of the generating process and a reduced average simulation time. Even though the theoretical investigation of standard forward-backward diffusion models has attracted much attention in the recent past, the statistical convergence properties of FHDMs are not yet understood. In this work, we show that, up to logarithmic factors, FHDMs achieve the minimax optimal convergence rate in total variation for spherically supported Sobolev smooth data distributions. In particular, this is the first statistical optimality result for denoising diffusion modelling with random generation time.
Reliable Chain-of-Thought via Prefix Consistency
Iwase, Naoto, Ichihara, Yuki, Quamar, Mohammad Atif, Komiyama, Junpei
Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto-iwase/prefix-consistency.
TopoFisher: Learning Topological Summary Statistics by Maximizing Fisher Information
Biagetti, Matteo, Carriรจre, Mathieu, Conti, Francesco, Ferrari, Enrico Maria, Heydenreich, Sven, Viswanathan, Karthik
Persistence diagrams provide stable, interpretable summaries of geometric and topological structure and are useful for simulation-based inference when low-order statistics miss key information. Yet persistence-based pipelines require hand-chosen filtrations, vectorizations, and compressors, typically without an objective tied to parameter uncertainty. We introduce \textbf{TopoFisher}, a differentiable persistent-homology pipeline that learns topological summaries by maximizing local Gaussian Fisher information. Using simulations near a fiducial parameter, TopoFisher optimizes trainable filtrations, diagram vectorizations, and compressors without posterior samples or supervised regression targets, while retaining stable topological inductive bias. We also give sufficient regularity conditions for the log-determinant Fisher loss to be locally Lipschitz in trainable parameters. Controlled experiments on noisy spirals and Gaussian random fields, where total Fisher information is known, show that TopoFisher recovers much of the available information and outperforms fixed topological vectorizations. Our main results are on weak gravitational lensing, a high-dimensional non-Gaussian cosmological field-inference problem. Learned topological summaries reach higher Fisher information than state-of-the-art cosmological summaries and approach an unconstrained Information Maximising Neural Network baseline with up to $\sim80\times$ fewer parameters. The learned filtrations also generalize better: under simulator shift from lognormal to LPT-based maps it retains most Fisher information, while the neural baseline drops, and in neural posterior estimation they give tighter constraints than the neural baseline, and of state-of-the-art cosmological summaries. These results support Fisher-based topological optimization as a robust, parameter-efficient front end for simulation-based inference.
Flow Matching for Count Data
High-dimensional count data arise in applications such as single-cell RNA sequencing and neural spike trains, where mapping between distributions across successive batches or time points form critical components of data analysis. The recent success of diffusion- and flow-based deep generative models for images, video, and text motivates extending these ideas to count-valued settings, but many existing methods either treat each count as a categorical state or transform counts into a continuous space, neither of which is natural or efficient when the count range is large. We propose count-FM, a flow-matching framework for count data based on a continuous-time birth-death process with local unit jumps. Count-FM learns marginal transitions efficiently in count space through simulation-free training of conditional transition rates, allowing transport between arbitrary count-distributed source and target populations. In simulation, count-FM achieves better sample quality than representative baselines while using substantially fewer parameters. We further apply count-FM to scRNA-seq and neural spike-train data for unconditional generation, transport, and conditional generation. Across these tasks, count-FM yields improved sample quality, greater modeling efficiency, and interpretable transport paths.
Black-box model classification under the discriminative factorization
Helm, Hayden, Ohata, Merrick, Priebe, Carey
Access to modern generative systems is often restricted to querying an API (the ``black-box" setting) and many properties of the system are unknown to the user at inference time. While recent work has shown that low-dimensional representations of models based on the relationship between their embedded responses to a set of queries are useful for inferring model-level properties, the quality of these representations is highly sensitive to the query set. We introduce the \emph{discriminative factorization} to distinguish between high- and low-quality query sets in the context of black-box model-level classification. Under this framework, the probability of chance-level classification decays exponentially in the query budget. On three auditing tasks, estimated factorization parameters predict the empirical performance decay rate. We conclude by showing that query sets selected using the estimated discriminative field reproduce the empirical ordering of oracle query sets.
Characterizing and Correcting Effective Target Shift in Online Learning
Online learning from a stream of data is a defining feature of intelligence, yet modern machine learning systems often struggle in this setting, especially under distributional shift. To understand its basic properties, we study the relationship between online and offline learning in the context of kernel regression. We derive a closed-form expression for the function learned by online kernel regression, revealing that online kernel regression is equivalent to offline regression with shifted, inaccurate target outputs. Conversely, we show that by compensating for this effective shift in the teaching signal through target correction, online kernel-based learning can provably learn the same predictor as its offline counterpart. We derive both a closed-form expression for this target correction and an iterative form that can be applied sequentially. Applying this framework to image classification tasks on CIFAR-10 and CORe50, we show that online stochastic gradient descent with iteratively corrected targets outperforms learning with the true targets in continual learning settings. This work therefore provides a basic framework for analyzing and improving online learning in non-stationary environments.
Consistency Regularised Gradient Flows for Inverse Problems
Spagnoletti, Alessio, Wang, Tim Y. J., Pereyra, Marcelo, Akyildiz, O. Deniz
Vision-Language Latent Diffusion Models (LDMs) (Rombach et al., 2022) provide powerful generative priors for inverse problems. However, existing LDM-based inverse solvers typically require a large number of neural function evaluations (NFEs) and backpropagation through large pretrained components, leading to substantial computational costs and, in some cases, degraded reconstruction quality. We propose a unified Euclidean-Wasserstein-2 gradient-flow framework that jointly performs posterior sampling and prompt optimization in the latent space through a single flow that aligns the prior and posterior with the observed data. Combined with few-step latent text-to-image models, this formulation enables low-NFE inference without backpropagation through autoencoders. Experiments across several canonical imaging inverse problems show that our method achieves state-of-the-art performance with significantly reduced computational cost.
It Just Takes Two: Scaling Amortized Inference to Large Sets
Wehenkel, Antoine, Kagan, Michael, Heinrich, Lukas, Pollard, Chris
Neural posterior estimation has emerged as a powerful tool for amortized inference, with growing adoption across scientific and applied domains. In many of these applications, the conditioning variable is a set of observations whose elements depend not only on the target but also on unknown factors shared across the set. Optimal inference therefore requires treating the set jointly, which in turn requires training the estimator at the deployment set size -- a regime where memory and compute quickly become prohibitive. We introduce a simple, theoretically grounded strategy that decouples representation learning from posterior modeling. Our method trains a mean-pool Deep Set on sets of size at most two, producing an encoder that generalizes to arbitrary set sizes. The inference head is then finetuned on pre-aggregated embeddings, making training cost essentially independent of the deployment set size N. Across scalar, image, multi-view 3D, molecular, and high-dimensional conditional generation benchmarks with N in the thousands, our approach matches or outperforms standard baselines at a fraction of the compute.