Computational Learning Theory

Globally Optimal Learning for Structured Elliptical Losses

Neural Information Processing Systems

Heavy tailed and contaminated data are common in various applications of machine learning. A standard technique to handle regression tasks that involve such data, is to use robust losses, e.g., the popular Huber's loss. In structured problems, however, where there are multiple labels and structural constraints on the labels are imposed (or learned), robust optimization is challenging, and more often than not the loss used is simply the negative log-likelihood of a Gaussian Markov random field. Heavy tailed and contaminated data are common in various applications of machine learning. A standard technique to handle regression tasks that involve such data, is to use robust losses, e.g., the popular Huber's loss.

On the Hardness of Robust Classification

Neural Information Processing Systems

It is becoming increasingly important to understand the vulnerability of machine learning models to adversarial attacks. In this paper we study the feasibility of robust learning from the perspective of computational learning theory, considering both sample and computational complexity. In particular, our definition of robust learnability requires polynomial sample complexity. We start with two negative results. We show that no non-trivial concept class can be robustly learned in the distribution-free setting against an adversary who can perturb just a single input bit.

Graph-based Discriminators: Sample Complexity and Expressiveness

Neural Information Processing Systems

A basic question in learning theory is to identify if two distributions are identical when we have access only to examples sampled from the distributions. This basic task is considered, for example, in the context of Generative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a real-life distribution and a synthetic distribution. Classically, we use a hypothesis class $H$ and claim that the two distributions are distinct if for some $h\in H$ the expected value on the two distributions is (significantly) different. Our starting point is the following fundamental problem: "is having the hypothesis dependent on more than a single random example beneficial". To address this challenge we define $k$-ary based discriminators, which have a family of Boolean $k$-ary functions $\G$.

Distribution-Independent PAC Learning of Halfspaces with Massart Noise

Neural Information Processing Systems

We study the problem of {\em distribution-independent} PAC learning of halfspaces in the presence of Massart noise. Specifically, we are given a set of labeled examples $(\bx, y)$ drawn from a distribution $\D$ on $\R {d 1}$ such that the marginal distribution on the unlabeled points $\bx$ is arbitrary and the labels $y$ are generated by an unknown halfspace corrupted with Massart noise at noise rate $\eta 1/2$. We give a $\poly\left(d, 1/\eps\right)$ time algorithm for this problem with misclassification error $\eta \eps$. We also provide evidence that improving on the error guarantee of our algorithm might be computationally hard. Prior to our work, no efficient weak (distribution-independent) learner was known in this model, even for the class of disjunctions.

Generalized Policy Elimination: an efficient algorithm for Nonparametric Contextual Bandits Machine Learning

We propose the Generalized Policy Elimination (GPE) algorithm, an oracle-efficient contextual bandit (CB) algorithm inspired by the Policy Elimination algorithm of \cite{dudik2011}. We prove the first regret optimality guarantee theorem for an oracle-efficient CB algorithm competing against a nonparametric class with infinite VC-dimension. Specifically, we show that GPE is regret-optimal (up to logarithmic factors) for policy classes with integrable entropy. For classes with larger entropy, we show that the core techniques used to analyze GPE can be used to design an $\varepsilon$-greedy algorithm with regret bound matching that of the best algorithms to date. We illustrate the applicability of our algorithms and theorems with examples of large nonparametric policy classes, for which the relevant optimization oracles can be efficiently implemented.

Demystifying the world of deep networks


Introductory statistics courses teach us that, when fitting a model to some data, we should have more data than free parameters to avoid the danger of overfitting -- fitting noisy data too closely, and thereby failing to fit new data. It is surprising, then, that in modern deep learning the practice is to have orders of magnitude more parameters than data. Despite this, deep networks show good predictive performance, and in fact do better the more parameters they have. It has been known for some time that good performance in machine learning comes from controlling the complexity of networks, which is not just a simple function of the number of free parameters. The complexity of a classifier, such as a neural network, depends on measuring the "size" of the space of functions that this network represents, with multiple technical measures previously suggested: Vapnik–Chervonenkis dimension, covering numbers, or Rademacher complexity, to name a few.

Decidability of Sample Complexity of PAC Learning in finite setting Machine Learning

In this short note we observe that the sample complexity of PAC machine learning of various concepts, including learning the maximum (EMX), can be exactly determined when the support of the probability measures considered as models satisfies an a-priori bound. This result contrasts with the recently discovered undecidability of EMX within ZFC for finitely supported probabilities (with no a priori bound). Unfortunately, the decision procedure is at present, at least doubly exponential in the number of points times the uniform bound on the support size.

A General Method for Robust Learning from Batches Machine Learning

In many applications, data is collected in batches, some of which are corrupt or even adversarial. Recent work derived optimal robust algorithms for estimating discrete distributions in this setting. We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains. Building on these results, we derive the first robust agnostic computationally-efficient learning algorithms for piecewise-interval classification, and for piecewise-polynomial, monotone, log-concave, and gaussian-mixture distribution estimation.

On the Sample Complexity of Adversarial Multi-Source PAC Learning Machine Learning

We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms. Specifically, we analyze the situation in which a learning system obtains datasets from multiple sources, some of which might be biased or even adversarially perturbed. It is known that in the single-source case, an adversary with the power to corrupt a fixed fraction of the training data can prevent PAC-learnability, that is, even in the limit of infinitely much training data, no learning system can approach the optimal test error. In this work we show that, surprisingly, the same is not true in the multi-source setting, where the adversary can arbitrarily corrupt a fixed fraction of the data sources. Our main results are a generalization bound that provides finite-sample guarantees for this learning setting, as well as corresponding lower bounds. Besides establishing PAC-learnability our results also show that in a cooperative learning setting sharing data with other parties has provable benefits, even if some participants are malicious.

Theory of Optimal Learning Machines


Matteo Marsili from The Abdus Salam International Centre for Theoretical Physics with lecture titleTheory of Optimal Learning Machines is now publicy available.