Computational Learning Theory
Revisiting complexity and the bias-variance tradeoff
Dwivedi, Raaz, Singh, Chandan, Yu, Bin, Wainwright, Martin J.
The recent success of high-dimensional models, such as deep neural networks (DNNs), has led many to question the validity of the bias-variance tradeoff principle in high dimensions. We reexamine it with respect to two key choices: the model class and the complexity measure. We argue that failing to suitably specify either one can falsely suggest that the tradeoff does not hold. This observation motivates us to seek a valid complexity measure, defined with respect to a reasonably good class of models. Building on Rissanen's principle of minimum description length (MDL), we propose a novel MDL-based complexity (MDL-COMP). We focus on the context of linear models, which have been recently used as a stylized tractable approximation to DNNs in high-dimensions. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We derive closed-form expressions for MDL-COMP and show that for a dataset with $n$ observations and $d$ parameters it is \emph{not always} equal to $d/n$, and is a function of the singular values of the design matrix and the signal-to-noise ratio. For random Gaussian design, we find that while MDL-COMP scales linearly with $d$ in low-dimensions ($d
Smoothed Analysis of Online and Differentially Private Learning
Haghtalab, Nika, Roughgarden, Tim, Shetty, Abhishek
Robustness to changes in the data and protecting the privacy of data are two of the main challenges faced by machine learning and have led to the design of online and differentially private learning algorithms. While offline PAC learnability is characterized by the finiteness of VC dimension, online and differentially private learnability are both characterized by the finiteness of the Littlestone dimension [Alon et al., 2019, Ben-David et al., 2009, Bun et al., 2020]. This latter characterization is often interpreted as an impossibility result for achieving robustness and privacy on worst-case instances, especially in classification where even simple hypothesis classes such as 1-dimensional thresholds have constant VC dimension but infinite Littlestone dimension. Impossibility results for worst-case adversaries do not invalidate the original goals of robust and private learning with respect to practically relevant hypothesis classes; rather, they indicate that a new model is required to provide rigorous guidance on the design of online and differentially private learning algorithms. In this work, we go beyond worst-case analysis and design online learning algorithms and differentially private learning algorithms as good as their offline and non-private PAC learning counterparts in a realistic semi-random model of data. Inspired by smoothed analysis [Spielman and Teng, 2004], we introduce frameworks for online and differentially private learning in which adversarially chosen inputs are perturbed slightly by nature (reflecting, e.g., measurement errors or uncertainty). Equivalently, we consider an adversary restricted to choose an input distribution that is not overly concentrated, with the realized input then drawn from the adversarys chosen distribution. Our goal is to design algorithms with good expected regret and error bounds, where the expectation is over natures perturbations (and any random coin flips of the algorithm). Our positive results show, in a precise sense, that the known lower bounds for worst-case online and differentially private learnability are fundamentally brittle.
Discovering outstanding subgroup lists for numeric targets using MDL
Proenรงa, Hugo M., Grรผnwald, Peter, Bรคck, Thomas, van Leeuwen, Matthijs
The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.
Estimates on Learning Rates for Multi-Penalty Distribution Regression
This paper is concerned with functional learning by utilizing two-stage sampled distribution regression. We study a multi-penalty regularization algorithm for distribution regression under the framework of learning theory. The algorithm aims at regressing to real valued outputs from probability measures. The theoretical analysis on distribution regression is far from maturity and quite challenging, since only second stage samples are observable in practical setting. In the algorithm, to transform information from samples, we embed the distributions to a reproducing kernel Hilbert space $\mathcal{H}_K$ associated with Mercer kernel $K$ via mean embedding technique. The main contribution of the paper is to present a novel multi-penalty regularization algorithm to capture more features of distribution regression and derive optimal learning rates for the algorithm. The work also derives learning rates for distribution regression in the nonstandard setting $f_{\rho}\notin\mathcal{H}_K$, which is not explored in existing literature. Moreover, we propose a distribution regression-based distributed learning algorithm to face large-scale data or information challenge. The optimal learning rates are derived for the distributed learning algorithm. By providing new algorithms and showing their learning rates, we improve the existing work in different aspects in the literature.
Categorical anomaly detection in heterogeneous data using minimum description length clustering
Cheney, James, Gombau, Xavier, Berrada, Ghita, Benabderrahmane, Sidahmed
Two examples of anomaly detection based on MDL have been been proposed for categorical data based on the minimum description studied and shown to perform well: the OC3 algorithm [21] based length (MDL) principle. However, they can be ineffective when on an itemset mining technique called Krimp [26], and the CompreX detecting anomalies in heterogeneous datasets representing a mixture algorithm [2]. Broadly speaking, both take a similar approach: of different sources, such as security scenarios in which system first, a model H of the data that compresses it well is found using a and user processes have distinct behavior patterns. We propose a heuristic search, balancing the model complexity L(H) (number of meta-algorithm for enhancing any MDL-based anomaly detection bits required to compress the model structure/parameters) against model to deal with heterogeneous data by fitting a mixture model the data complexity L(X H) (number of bits required to compress to the data, via a variant of k-means clustering. Our experimental the data given the model). Once such a model H is found, we assign results show that using a discrete mixture model provides competitive to each object x X a score corresponding to the object's performance relative to two previous anomaly detection compressed size L(x H) given the selected model. Intuitively, if the algorithms, while mixtures of more sophisticated models yield further model accurately characterizes the data as a whole, records that are gains, on both synthetic datasets and realistic datasets from a representative will compress well, yielding a low anomaly score, security scenario.
Non-Convex SGD Learns Halfspaces with Adversarial Label Noise
Diakonikolas, Ilias, Kontonis, Vasilis, Tzamos, Christos, Zarifis, Nikos
Learning in the presence of noisy data is a central challenge in machine learning. In this work, we study the efficient learnability of halfspaces when a fraction of the training labels is adversarially corrupted. As our main contribution, we show that non-convex SGD efficiently learns homogeneous halfspaces in the presence of adversarial label noise with respect to a broad family of well-behaved distributions, including log-concave distributions. Before we state our contributions, we provide some background and motivation for this work.
Learning Halfspaces with Tsybakov Noise
Diakonikolas, Ilias, Kontonis, Vasilis, Tzamos, Christos, Zarifis, Nikos
We study the efficient PAC learnability of halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, each label is independently flipped with some probability which is controlled by an adversary. This noise model significantly generalizes the Massart noise model, by allowing the flipping probabilities to be arbitrarily close to $1/2$ for a fraction of the samples. Our main result is the first non-trivial PAC learning algorithm for this problem under a broad family of structured distributions -- satisfying certain concentration and (anti-)anti-concentration properties -- including log-concave distributions. Specifically, we given an algorithm that achieves misclassification error $\epsilon$ with respect to the true halfspace, with quasi-polynomial runtime dependence in $1/\epsilin$. The only previous upper bound for this problem -- even for the special case of log-concave distributions -- was doubly exponential in $1/\epsilon$ (and follows via the naive reduction to agnostic learning). Our approach relies on a novel computationally efficient procedure to certify whether a candidate solution is near-optimal, based on semi-definite programming. We use this certificate procedure as a black-box and turn it into an efficient learning algorithm by searching over the space of halfspaces via online convex optimization.
Optimal Continual Learning has Perfect Memory and is NP-hard
Knoblauch, Jeremias, Husain, Hisham, Diethe, Tom
Continual Learning (CL) algorithms incrementally learn a predictor or representation across multiple sequentially observed tasks. Designing CL algorithms that perform reliably and avoid so-called catastrophic forgetting has proven a persistent challenge. The current paper develops a theoretical approach that explains why. In particular, we derive the computational properties which CL algorithms would have to possess in order to avoid catastrophic forgetting. Our main finding is that such optimal CL algorithms generally solve an NP-hard problem and will require perfect memory to do so. The findings are of theoretical interest, but also explain the excellent performance of CL algorithms using experience replay, episodic memory and core sets relative to regularization-based approaches.
Faster PAC Learning and Smaller Coresets via Smoothed Analysis
Maalouf, Alaa, Jubran, Ibrahim, Tukan, Murad, Feldman, Dan
PAC-learning usually aims to compute a small subset ($\varepsilon$-sample/net) from $n$ items, that provably approximates a given loss function for every query (model, classifier, hypothesis) from a given set of queries, up to an additive error $\varepsilon\in(0,1)$. Coresets generalize this idea to support multiplicative error $1\pm\varepsilon$. Inspired by smoothed analysis, we suggest a natural generalization: approximate the \emph{average} (instead of the worst-case) error over the queries, in the hope of getting smaller subsets. The dependency between errors of different queries implies that we may no longer apply the Chernoff-Hoeffding inequality for a fixed query, and then use the VC-dimension or union bound. This paper provides deterministic and randomized algorithms for computing such coresets and $\varepsilon$-samples of size independent of $n$, for any finite set of queries and loss function. Example applications include new and improved coreset constructions for e.g. streaming vector summarization [ICML'17] and $k$-PCA [NIPS'16]. Experimental results with open source code are provided.
Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Connections to Evolvability
Chen, Sitan, Koehler, Frederic, Moitra, Ankur, Yau, Morris
In this paper we revisit some classic problems on classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate $\eta$. In a recent work, Diakonikolas, Goulekakis, and Tzamos resolved a long-standing problem by giving the first efficient algorithm for learning to accuracy $\eta + \epsilon$ for any $\epsilon > 0$. However, their algorithm outputs a complicated hypothesis, which partitions space into $\text{poly}(d,1/\epsilon)$ regions. Here we give a much simpler algorithm and in the process resolve a number of outstanding open questions: (1) We give the first proper learner for Massart halfspaces that achieves $\eta + \epsilon$. We also give improved bounds on the sample complexity achievable by polynomial time algorithms. (2) Based on (1), we develop a blackbox knowledge distillation procedure to convert an arbitrarily complex classifier to an equally good proper classifier. (3) By leveraging a simple but overlooked connection to evolvability, we show any SQ algorithm requires super-polynomially many queries to achieve $\mathsf{OPT} + \epsilon$. Moreover we study generalized linear models where $\mathbb{E}[Y|\mathbf{X}] = \sigma(\langle \mathbf{w}^*, \mathbf{X}\rangle)$ for any odd, monotone, and Lipschitz function $\sigma$. This family includes the previously mentioned halfspace models as a special case, but is much richer and includes other fundamental models like logistic regression. We introduce a challenging new corruption model that generalizes Massart noise, and give a general algorithm for learning in this setting. Our algorithms are based on a small set of core recipes for learning to classify in the presence of misspecification. Finally we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties.