Goto

Collaborating Authors

 fout


Spectrum-Adaptive Generalization Bounds for Trained Deep Transformers

arXiv.org Machine Learning

Understanding why trained Transformers generalize well is a fundamental problem in modern machine learning theory, and complexity-based generalization bounds provide a principled way to study this question. While existing norm-based bounds for Transformers remove the explicit polynomial dependence on the hidden dimension, they typically impose fixed norm constraints specified a priori and can exhibit unfavorable exponential dependence on depth. In this paper, we derive spectrum-adaptive post hoc generalization bounds for multi-layer Transformers. Under layerwise spectral norm control, the bounds are expressed in terms of layerwise Schatten quantities of the query-key, value, and feedforward weight matrices. Since the Schatten indices need not be fixed a priori and can instead be selected after training, separately for each matrix type and layer, the bounds adaptively trade off spectral complexity against the dimension- and depth-dependent factors according to the learned singular-value profiles. Empirical comparisons of BERT-adapted proxies for the leading complexity factors suggest that the proxies induced by our bounds grow more slowly with depth and hidden dimension than the corresponding norm-based proxies. Overall, our results provide a complexity-based perspective on how the spectral structure of trained Transformers is reflected in generalization analyses.


Strategic Classification under Unknown Personalized Manipulation Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

We study the fundamental mistake bound and sample complexity in the strategic1 classification, where agents can strategically manipulate their feature vector up2 to an extent in order to be predicted as positive. For example, given a classifier3 determining college admission, student candidates may try to take easier classes to4 improve their GPA, retake SAT and change schools in an effort to fool the classifier.5 Ball manipulations are a widely studied class of manipulations in the literature,6 where agents can modify their feature vector within a bounded radius ball. Unlike7 most prior work, our work consider manipulations to be personalized, meaning8 that agents can have different levels of manipulation abilities (e.g., varying radii9 for ball manipulations), and unknown to the learner.10 We formalize the learning problem in an interaction model where the learner11 first deploys a classifier and the agent manipulates the feature vector within their12 manipulation set to game the deployed classifier. We investigate various scenarios13 in terms of the information available to the learner during the interaction, such14 as observing the original feature vector before or after deployment, observing the15 manipulated feature vector, or not seeing either the original or the manipulated16 feature vector. We begin by providing online mistake bounds and PAC sample17 complexity in these scenarios for ball manipulations. We also explore non-ball18 manipulations and show that, even in the simplest scenario where both the original19 and the manipulated feature vectors are revealed, the mistake bounds and sample20 complexity are lower bounded by Ω(|H|) when the target function belongs to a21 known class H.22


Separating Geometry from Probability in the Analysis of Generalization

arXiv.org Machine Learning

The goal of machine learning is to find models that minimize prediction error on data that has not yet been seen. Its operational paradigm assumes access to a dataset $S$ and articulates a scheme for evaluating how well a given model performs on an arbitrary sample. The sample can be $S$ (in which case we speak of ``in-sample'' performance) or some entirely new $S'$ (in which case we speak of ``out-of-sample'' performance). Traditional analysis of generalization assumes that both in- and out-of-sample data are i.i.d.\ draws from an infinite population. However, these probabilistic assumptions cannot be verified even in principle. This paper presents an alternative view of generalization through the lens of sensitivity analysis of solutions of optimization problems to perturbations in the problem data. Under this framework, generalization bounds are obtained by purely deterministic means and take the form of variational principles that relate in-sample and out-of-sample evaluations through an error term that quantifies how close out-of-sample data are to in-sample data. Statistical assumptions can then be used \textit{ex post} to characterize the situations when this error term is small (either on average or with high probability).


The Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks

arXiv.org Machine Learning

A key capability of modern neural networks is their capacity to simultaneously learn underlying rules and memorize specific facts or exceptions. Yet, theoretical understanding of this dual capability remains limited. We introduce the Rules-and-Facts (RAF) model, a minimal solvable setting that enables precise characterization of this phenomenon by bridging two classical lines of work in the statistical physics of learning: the teacher-student framework for generalization and Gardner-style capacity analysis for memorization. In the RAF model, a fraction $1 - \varepsilon$ of training labels is generated by a structured teacher rule, while a fraction $\varepsilon$ consists of unstructured facts with random labels. We characterize when the learner can simultaneously recover the underlying rule - allowing generalization to new data - and memorize the unstructured examples. Our results quantify how overparameterization enables the simultaneous realization of these two objectives: sufficient excess capacity supports memorization, while regularization and the choice of kernel or nonlinearity control the allocation of capacity between rule learning and memorization. The RAF model provides a theoretical foundation for understanding how modern neural networks can infer structure while storing rare or non-compressible information.




74dbd1111727a31a2b825d615d80b2e7-Supplemental.pdf

Neural Information Processing Systems

Recent empirical successes in large-scale machine learning have been powered by massive data parallelism and hardware acceleration, with batch sizes trending beyond 10K+ images [46] or 1M+ tokens [9]. Numerous interdisciplinarysources [5,12,24,33]indicate that the performance bottlenecks of contemporary deep learning pipelines can lie in many places other than gradient computation.



HiPPO-Prophecy: State-Space Models can Provably Learn Dynamical Systems in Context

arXiv.org Machine Learning

This work explores the in-context learning capabilities of State Space Models (SSMs) and presents, to the best of our knowledge, the first theoretical explanation of a possible underlying mechanism. We introduce a novel weight construction for SSMs, enabling them to predict the next state of any dynamical system after observing previous states without parameter fine-tuning. This is accomplished by extending the HiPPO framework to demonstrate that continuous SSMs can approximate the derivative of any input signal. Specifically, we find an explicit weight construction for continuous SSMs and provide an asymptotic error bound on the derivative approximation. The discretization of this continuous SSM subsequently yields a discrete SSM that predicts the next state. Finally, we demonstrate the effectiveness of our parameterization empirically. This work should be an initial step toward understanding how sequence models based on SSMs learn in context.