Computational Learning Theory
Towards Problem-dependent Optimal Learning Rates
We study problem-dependent rates, i.e., generalization errors that scale tightly with the variance or the effective loss at the "best hypothesis." Existing uniform convergence and localization frameworks, the most widely used tools to study this problem, often fail to simultaneously provide parameter localization and optimal dependence on the sample size. As a result, existing problem-dependent rates are often rather weak when the hypothesis class is "rich" and the worst-case bound of the loss is large. In this paper we propose a new framework based on a "uniform localized convergence" principle. We provide the first (moment-penalized) estimator that achieves the optimal variance-dependent rate for general "rich" classes; we also establish improved loss-dependent rate for standard empirical risk minimization.
Spherical dimension
Chornomaz, Bogdan, Moran, Shay, Waknine, Tom
We introduce and study the spherical dimension, a natural topological relaxation of the VC dimension that unifies several results in learning theory where topology plays a key role in the proofs. The spherical dimension is defined by extending the set of realizable datasets (used to define the VC dimension) to the continuous space of realizable distributions. In this space, a shattered set of size d (in the VC sense) is completed into a continuous object, specifically a d-dimensional sphere of realizable distributions. The spherical dimension is then defined as the dimension of the largest sphere in this space. Thus, the spherical dimension is at least the VC dimension. The spherical dimension serves as a common foundation for leveraging the Borsuk-Ulam theorem and related topological tools. We demonstrate the utility of the spherical dimension in diverse applications, including disambiguations of partial concept classes, reductions from classification to stochastic convex optimization, stability and replicability, and sample compression schemes. Perhaps surprisingly, we show that the open question posed by Alon, Hanneke, Holzman, and Moran (FOCS 2021) of whether there exist non-trivial disambiguations for halfspaces with margin is equivalent to the basic open question of whether the VC and spherical dimensions are finite together.
Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration
Ehrhardt, Emily C., Gottschalk, Hanno, Riedlinger, Tobias J.
NeuralODE is one example for generative machine learning based on the push forward of a simple source measure with a bijective mapping, which in the case of NeuralODE is given by the flow of a ordinary differential equation. Using Liouville's formula, the log-density of the push forward measure is easy to compute and thus NeuralODE can be trained based on the maximum Likelihood method such that the Kulback-Leibler divergence between the push forward through the flow map and the target measure generating the data becomes small. In this work, we give a detailed account on the consistency of Maximum Likelihood based empirical risk minimization for a generic class of target measures. In contrast to prior work, we do not only consider the statistical learning theory, but also give a detailed numerical analysis of the NeuralODE algorithm based on the 2nd order Runge-Kutta (RK) time integration. Using the universal approximation theory for deep ReQU networks, the stability and convergence rated for the RK scheme as well as metric entropy and concentration inequalities, we are able to prove that NeuralODE is a probably approximately correct (PAC) learning algorithm.
Revisiting Agnostic Boosting
da Cunha, Arthur, Hรธgsgaard, Mikael Mรธller, Paudice, Andrea, Sun, Yuxin
Boosting is a key method in statistical learning, allowing for converting weak learners into strong ones. While well studied in the realizable case, the statistical properties of weak-to-strong learning remains less understood in the agnostic setting, where there are no assumptions on the distribution of the labels. In this work, we propose a new agnostic boosting algorithm with substantially improved sample complexity compared to prior works under very general assumptions. Our approach is based on a reduction to the realizable case, followed by a margin-based filtering step to select high-quality hypotheses. We conjecture that the error rate achieved by our proposed method is optimal up to logarithmic factors.
Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification
We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. Grunwald and Langford [2004] previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of $\lambda=1$, suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter $\lambda$ and noise level (or approximation error), significantly tightening the analysis of Grunwald and Langford for $\lambda=1$ and extending it to all other choices of $\lambda$.
Fair Text Classification via Transferable Representations
Leteno, Thibaud, Perrot, Michael, Laclau, Charlotte, Gourru, Antoine, Gravier, Christophe
Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g., women and men) remains an open challenge. We propose an approach that extends the use of the Wasserstein Dependency Measure for learning unbiased neural text classifiers. Given the challenge of distinguishing fair from unfair information in a text encoder, we draw inspiration from adversarial training by inducing independence between representations learned for the target label and those for a sensitive attribute. We further show that Domain Adaptation can be efficiently leveraged to remove the need for access to the sensitive attributes in the dataset we cure. We provide both theoretical and empirical evidence that our approach is well-founded.
A Theory of Learning with Autoregressive Chain of Thought
Joshi, Nirmit, Vardi, Gal, Block, Adam, Goel, Surbhi, Li, Zhiyuan, Misiakiewicz, Theodor, Srebro, Nathan
For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.
The Computational Complexity of Positive Non-Clashing Teaching in Graphs
Ganian, Robert, Khazaliya, Liana, Inerney, Fionn Mc, Rocton, Mathis
We study the classical and parameterized complexity of computing the positive non-clashing teaching dimension of a set of concepts, that is, the smallest number of examples per concept required to successfully teach an intelligent learner under the considered, previously established model. For any class of concepts, it is known that this problem can be effortlessly transferred to the setting of balls in a graph G. We establish (1) the NP-hardness of the problem even when restricted to instances with positive non-clashing teaching dimension k=2 and where all balls in the graph are present, (2) near-tight running time upper and lower bounds for the problem on general graphs, (3) fixed-parameter tractability when parameterized by the vertex integrity of G, and (4) a lower bound excluding fixed-parameter tractability when parameterized by the feedback vertex number and pathwidth of G, even when combined with k. Our results provide a nearly complete understanding of the complexity landscape of computing the positive non-clashing teaching dimension and answer open questions from the literature.
The Structural Complexity of Matrix-Vector Multiplication
Anand, Emile, Brand, Jan van den, McCarty, Rose
We consider the problem of preprocessing an $n\times n$ matrix M, and supporting queries that, for any vector v, returns the matrix-vector product Mv. This problem has been extensively studied in both theory and practice: on one side, practitioners have developed algorithms that are highly efficient in practice, whereas theoreticians have proven that the problem cannot be solved faster than naive multiplication in the worst-case. This lower bound holds even in the average-case, implying that existing average-case analyses cannot explain this gap between theory and practice. Therefore, we study the problem for structured matrices. We show that for $n\times n$ matrices of VC-dimension d, the matrix-vector multiplication problem can be solved with $\tilde{O}(n^2)$ preprocessing and $\tilde O(n^{2-1/d})$ query time. Given the low constant VC-dimensions observed in most real-world data, our results posit an explanation for why the problem can be solved so much faster in practice. Moreover, our bounds hold even if the matrix does not have a low VC-dimension, but is obtained by (possibly adversarially) corrupting at most a subquadratic number of entries of any unknown low VC-dimension matrix. Our results yield the first non-trivial upper bounds for many applications. In previous works, the online matrix-vector hypothesis (conjecturing that quadratic time is needed per query) was used to prove many conditional lower bounds, showing that it is impossible to compute and maintain high-accuracy estimates for shortest paths, Laplacian solvers, effective resistance, and triangle detection in graphs subject to node insertions and deletions in subquadratic time. Yet, via a reduction to our matrix-vector-multiplication result, we show we can maintain the aforementioned problems efficiently if the input is structured, providing the first subquadratic upper bounds in the high-accuracy regime.
Swap Regret and Correlated Equilibria Beyond Normal-Form Games
Arunachaleswaran, Eshwar Ram, Collina, Natalie, Mansour, Yishay, Mohri, Mehryar, Schneider, Jon, Sivan, Balasubramanian
Swap regret is a notion that has proven itself to be central to the study of general-sum normal-form games, with swap-regret minimization leading to convergence to the set of correlated equilibria and guaranteeing non-manipulability against a self-interested opponent. However, the situation for more general classes of games -- such as Bayesian games and extensive-form games -- is less clear-cut, with multiple candidate definitions for swap-regret but no known efficiently minimizable variant of swap regret that implies analogous non-manipulability guarantees. In this paper, we present a new variant of swap regret for polytope games that we call ``profile swap regret'', with the property that obtaining sublinear profile swap regret is both necessary and sufficient for any learning algorithm to be non-manipulable by an opponent (resolving an open problem of Mansour et al., 2022). Although we show profile swap regret is NP-hard to compute given a transcript of play, we show it is nonetheless possible to design efficient learning algorithms that guarantee at most $O(\sqrt{T})$ profile swap regret. Finally, we explore the correlated equilibrium notion induced by low-profile-swap-regret play, and demonstrate a gap between the set of outcomes that can be implemented by this learning process and the set of outcomes that can be implemented by a third-party mediator (in contrast to the situation in normal-form games).