Computational Learning Theory
An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples
ML builds heavily on statistics. For example, when we train our machine to learn, we have to give it a statistically significant random sample as training data. If the training set is not random, we run the risk of the machine learning patterns that aren't actually there. And if the training set is too small (see law of large numbers), we won't learn enough and may even reach inaccurate conclusions. For example, attempting to predict company-wide satisfaction patterns based on data from upper management alone would likely be error-prone.
How To Become A Machine Learning Expert In One Simple Step -- Swan Intelligence
The web is full of good explanations of machine learning algorithms. And every second applicant for a data science position has finished the Coursera course on machine learning. Theory will not help you choose good values for the 16 parameters a standard implementation of a random forest takes. The default values are good to get started, but which parameters should you modify depending on your data? Choosing the right features, algorithms and parameters is an art.
An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples
Machine Learning (ML) is coming into its own, with a growing recognition that ML can play a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. ML provides potential solutions in all these domains and more, and is set to be a pillar of our future civilization. The supply of able ML designers has yet to catch up to this demand. A major reason for this is that ML is just plain tricky. This tutorial introduces the basics of Machine Learning theory, laying down the common themes and concepts, making it easy to follow the logic and get comfortable with the topic. So what exactly is "machine learning" anyway?
Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns
Bertens, Roel, Vreeken, Jilles, Siebes, Arno
We study how to obtain concise descriptions of discrete multivariate sequential data. In particular, how to do so in terms of rich multivariate sequential patterns that can capture potentially highly interesting (cor)relations between sequences. To this end we allow our pattern language to span over the domains (alphabets) of all sequences, allow patterns to overlap temporally, as well as allow for gaps in their occurrences. We formalise our goal by the Minimum Description Length principle, by which our objective is to discover the set of patterns that provides the most succinct description of the data. To discover high-quality pattern sets directly from data, we introduce DITTO, a highly efficient algorithm that approximates the ideal result very well. Experiments show that DITTO correctly discovers the patterns planted in synthetic data. Moreover, it scales favourably with the length of the data, the number of attributes, the alphabet sizes. On real data, ranging from sensor networks to annotated text, DITTO discovers easily interpretable summaries that provide clear insight in both the univariate and multivariate structure.
Minimax Lower Bounds for Realizable Transductive Classification
Tolstikhin, Ilya, Lopez-Paz, David
Transductive learning considers a training set of $m$ labeled samples and a test set of $u$ unlabeled samples, with the goal of best labeling that particular test set. Conversely, inductive learning considers a training set of $m$ labeled samples drawn iid from $P(X,Y)$, with the goal of best labeling any future samples drawn iid from $P(X)$. This comparison suggests that transduction is a much easier type of inference than induction, but is this really the case? This paper provides a negative answer to this question, by proving the first known minimax lower bounds for transductive, realizable, binary classification. Our lower bounds show that $m$ should be at least $\Omega(d/\epsilon + \log(1/\delta)/\epsilon)$ when $\epsilon$-learning a concept class $\mathcal{H}$ of finite VC-dimension $d<\infty$ with confidence $1-\delta$, for all $m \leq u$. This result draws three important conclusions. First, general transduction is as hard as general induction, since both problems have $\Omega(d/m)$ minimax values. Second, the use of unlabeled data does not help general transduction, since supervised learning algorithms such as ERM and (Hanneke, 2015) match our transductive lower bounds while ignoring the unlabeled test set. Third, our transductive lower bounds imply lower bounds for semi-supervised learning, which add to the important discussion about the role of unlabeled data in machine learning.
The Optimal Sample Complexity of PAC Learning
This work establishes a new upper bound on the number of samples sufficient for PAC learning in the realizable case. The bound matches known lower bounds up to numerical constant factors. This solves a long-standing open problem on the sample complexity of PAC learning. The technique and analysis build on a recent breakthrough by Hans Simon.
On the Pseudo-Dimension of Nearly Optimal Auctions
Morgenstern, Jamie H., Roughgarden, Tim
This paper develops a general approach, rooted in statistical learning theory, to learning an approximately revenue-maximizing auction from data. We introduce t-level auctions to interpolate between simple auctions, such as welfare maximization with reserve prices, and optimal auctions, thereby balancing the competing demands of expressivity and simplicity. We prove that such auctions have small representation error, in the sense that for every product distribution F over biddersโ valuations, there exists a t-level auction with small t and expected revenue close to optimal. We show that the set of t-level auctions has modest pseudo-dimension (for polynomial t) and therefore leads to small learning error. One consequence of our results is that, in arbitrary single-parameter settings, one can learn a mechanism with expected revenue arbitrarily close to optimal from a polynomial number of samples.
Sum-of-Squares Lower Bounds for Sparse PCA
This paper establishes a statistical versus computational trade-offfor solving a basic high-dimensional machine learning problem via a basic convex relaxation method. Specifically, we consider the {\em Sparse Principal Component Analysis} (Sparse PCA) problem, and the family of {\em Sum-of-Squares} (SoS, aka Lasserre/Parillo) convex relaxations. It was well known that in large dimension $p$, a planted $k$-sparse unit vector can be {\em in principle} detected using only $n \approx k\log p$ (Gaussian or Bernoulli) samples, but all {\em efficient} (polynomial time) algorithms known require $n \approx k^2 $ samples. It was also known that this quadratic gap cannot be improved by the the most basic {\em semi-definite} (SDP, aka spectral) relaxation, equivalent to a degree-2 SoS algorithms. Here we prove that also degree-4 SoS algorithms cannot improve this quadratic gap. This average-case lower bound adds to the small collection of hardness results in machine learning for this powerful family of convex relaxation algorithms. Moreover, our design of moments (or ``pseudo-expectations'') for this lower bound is quite different than previous lower bounds. Establishing lower bounds for higher degree SoS algorithms for remains a challenging problem.
Spectral Norm Regularization of Orthonormal Representations for Graph Transduction
Shivanna, Rakesh, Chatterjee, Bibaswan K., Sankaran, Raman, Bhattacharyya, Chiranjib, Bach, Francis
Recent literature~\cite{ando} suggests that embedding a graph on an unit sphere leads to better generalization for graph transduction. However, the choice of optimal embedding and an efficient algorithm to compute the same remains open. In this paper, we show that orthonormal representations, a class of unit-sphere graph embeddings are PAC learnable. Existing PAC-based analysis do not apply as the VC dimension of the function class is infinite. We propose an alternative PAC-based bound, which do not depend on the VC dimension of the underlying function class, but is related to the famous Lov\'{a}sz~$\vartheta$ function. The main contribution of the paper is SPORE, a SPectral regularized ORthonormal Embedding for graph transduction, derived from the PAC bound. SPORE is posed as a non-smooth convex function over an \emph{elliptope}. These problems are usually solved as semi-definite programs (SDPs) with time complexity $O(n^6)$. We present, Infeasible Inexact proximal~(IIP): an Inexact proximal method which performs subgradient procedure on an approximate projection, not necessarily feasible. IIP is more scalable than SDP, has an $O(\frac{1}{\sqrt{T}})$ convergence, and is generally applicable whenever a suitable approximate projection is available. We use IIP to compute SPORE where the approximate projection step is computed by FISTA, an accelerated gradient descent procedure. We show that the method has a convergence rate of $O(\frac{1}{\sqrt{T}})$. The proposed algorithm easily scales to 1000's of vertices, while the standard SDP computation does not scale beyond few hundred vertices. Furthermore, the analysis presented here easily extends to the multiple graph setting.