Goto

Collaborating Authors

 complexity measure


A Rigorous, Tractable Measure of Model Complexity

arXiv.org Machine Learning

One of the most fundamental properties of a machine learning model is its complexity, with applications across topics such as interpretation, generalization, and model selection. Despite its importance, there is no canonical, model-agnostic way to assess a model's complexity. While simple heuristics, such as the number or magnitude of parameters, yield very crude estimates, hyperparameter-based approaches, such as polynomial degree or kernel length scale, do not generalize across model classes. More rigorous methods, including the Vapnik-Chervonenkis dimension (VCD) (Vapnik, 2013), Rademacher complexity (RMC) (Bartlett and Mendelson, 2002), and effective number of parameters (or effective degrees of freedom, ENP) (Efron, 1986), are difficult, or even impossible, to compute in practice, leaving the user to resort to crude bounds and/or approximations. The topic is further complicated by the often overlooked distinction between model and function complexity, where the former sets a ceiling on the latter.


On the Complexity of Adversarial Decision Making

Neural Information Processing Systems

A central problem in online learning and decision making--from bandits to reinforcement learning--is to understand what modeling assumptions lead to sampleefficient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show--via new upper and lower bounds--that the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. [17] in the stochastic counterpart to our setting, is necessary and sufficient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefficient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexification, and recovers a number of existing results--both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy [47] and the Exploration-by-Optimization objective of Lattimore and György [32].


Measuring Generalization with Optimal Transport

Neural Information Processing Systems

Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave disparately from empirical observations. In this work, we develop margin-based generalization bounds, where the margins are normalized with optimal transport costs between independent random subsets sampled from the training distribution. In particular, the optimal transport cost can be interpreted as a generalization of variance which captures the structural properties of the learned feature space. Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets. Theoretically, we demonstrate that the concentration and separation of features play crucial roles in generalization, supporting empirical results in the literature.


Assouad, Fano, and Le Cam with Interaction: A Unifying Lower Bound Framework and Characterization for Bandit Learnability

Neural Information Processing Systems

We develop a unifying framework for information-theoretic lower bound in statistical estimation and interactive decision making. Classical lower bound techniques---such as Fano's method, Le Cam's method, and Assouad's lemma---are central to the study of minimax risk in statistical estimation, yet are insufficient to provide tight lower bounds for \emph{interactive decision making} algorithms that collect data interactively (e.g., algorithms for bandits and reinforcement learning). Recent work of Foster et al. provides minimax lower bounds for interactive decision making using seemingly different analysis techniques from the classical methods. These results---which are proven using a complexity measure known as the \emph{Decision-Estimation Coefficient} (DEC)---capture difficulties unique to interactive learning, yet do not recover the tightest known lower bounds for passive estimation. We propose a unified view of these distinct methodologies through a new lower bound approach called \emph{interactive Fano method}. As an application, we introduce a novel complexity measure, the \emph{Fractional Covering Number}, which facilitates the new lower bounds for interactive decision making that extend the DEC methodology by incorporating the complexity of estimation. Using the fractional covering number, we (i) provide a unified characterization of learnability for \emph{any} stochastic bandit problem, (ii) close the remaining gap between the upper and lower bounds in Foster et al. (up to polynomial factors) for any interactive decision making problem in which the underlying model class is convex.


Sequential Probability Assignment with Contexts: Minimax Regret, Contextual Shtarkov Sums, and Contextual Normalized Maximum Likelihood

Neural Information Processing Systems

We study the fundamental problem of sequential probability assignment, also known as online learning with logarithmic loss, with respect to an arbitrary, possibly nonparametric hypothesis class. Our goal is to obtain a complexity measure for the hypothesis class that characterizes the minimax regret and to determine a general, minimax optimal algorithm. Notably, the sequential $\ell_{\infty}$ entropy, extensively studied in the literature (Rakhlin and Sridharan, 2015, Bilodeau et al., 2020, Wu et al., 2023), was shown to not characterize minimax regret in general. Inspired by the seminal work of Shtarkov (1987) and Rakhlin, Sridharan, and Tewari (2010), we introduce a novel complexity measure, the \emph{contextual Shtarkov sum}, corresponding to the Shtarkov sum after projection onto a multiary context tree, and show that the worst case log contextual Shtarkov sum equals the minimax regret. Using the contextual Shtarkov sum, we derive the minimax optimal strategy, dubbed \emph{contextual Normalized Maximum Likelihood} (cNML). Our results hold for sequential experts, beyond binary labels, which are settings rarely considered in prior work. To illustrate the utility of this characterization, we provide a short proof of a new regret upper bound in terms of sequential $\ell_{\infty}$ entropy, unifying and sharpening state-of-the-art bounds by Bilodeau et al. (2020) and Wu et al. (2023).


Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms

Neural Information Processing Systems

We present a novel set of rigorous and computationally efficient topology-based complexity notions that exhibit a strong correlation with the generalization gap in modern deep neural networks (DNNs). DNNs show remarkable generalization properties, yet the source of these capabilities remains elusive, defying the established statistical learning theory. Recent studies have revealed that properties of training trajectories can be indicative of generalization. Building on this insight, state-of-the-art methods have leveraged the topology of these trajectories, particularly their fractal dimension, to quantify generalization. Most existing works compute this quantity by assuming continuous-or infinite-time training dynamics, complicating the development of practical estimators capable of accurately predicting generalization without access to test data.


Lifelong Learning with Weighted Majority Votes

Neural Information Processing Systems

Better understanding of the potential benefits of information transfer and representation learning is an important step towards the goal of building intelligent systems that are able to persist in the world and learn over time. In this work, we consider a setting where the learner encounters a stream of tasks but is able to retain only limited information from each encountered task, such as a learned predictor. In contrast to most previous works analyzing this scenario, we do not make any distributional assumptions on the task generating process. Instead, we formulate a complexity measure that captures the diversity of the observed tasks. We provide a lifelong learning algorithm with error guarantees for every observed task (rather than on average). We show sample complexity reductions in comparison to solving every task in isolation in terms of our task complexity measure. Further, our algorithmic framework can naturally be viewed as learning a representation from encountered tasks with a neural network.