Computational Learning Theory
The Mathematics of Machine Learning
In the last few months, I have had several people contact me about their enthusiasm for venturing into the world of data science and using Machine Learning (ML) techniques to probe statistical regularities and build impeccable data-driven products. However, I've observed that some actually lack the necessary mathematical intuition and framework to get useful results. This is the main reason I decided to write this blog post. Recently, there has been an upsurge in the availability of many easy-to-use machine and deep learning packages such as scikit-learn, Weka, Tensorflow etc. Machine Learning theory is a field that intersects statistical, probabilistic, computer science and algorithmic aspects arising from learning iteratively from data and finding hidden insights which can be used to build intelligent applications. Despite the immense possibilities of Machine and Deep Learning, a thorough mathematical understanding of many of these techniques is necessary for a good grasp of the inner workings of the algorithms and getting good results.
Generalization Analysis for Ranking Using Integral Operator
Liu, Yong (Institute of Information Engineering, Chinese Academy of Sciences) | Liao, Shizhong (Tianjin University) | Lin, Hailun (Institute of Information Engineering, Chinese Academy of Sciences) | Yue, Yinliang (Institute of Information Engineering, Chinese Academy of Sciences) | Wang, Weiping (Institute of Information Engineering, Chinese Academy of Sciences)
The study on generalization performance of ranking algorithms is one of the fundamental issues in ranking learning theory. Although several generalization bounds have been proposed based on different measures, the convergence rates of the existing bounds are usually at most O (โ1/ n ), where n is the size of data set. In this paper, we derive novel generalization bounds for the regularized ranking in reproducing kernel Hilbert space via integral operator of kernel function. We prove that the rates of our bounds are much faster than (โ1/ n ). Specifically, we first introduce a notion of local Rademacher complexity for ranking, called local ranking ย Rademacher complexity, which is used to measure the complexity of the space of loss functions of the ranking. Then, we use the local ranking Rademacher complexity to obtain a basic generalization bound. Finally, we establish the relationship between the local Rademacher complexity and the eigenvalues of integral operator, and further derive sharp generalization bounds of faster convergence rate.
SAT Competition 2016: Recent Developments
Balyo, Tomas ( Karlsruhe Institute of Technology Karlsruhe, Germany ) | Heule, Marijn J. H. (The University of Texas at Austin) | Jarvisalo, Matti (HIIT, Department of Computer Science University of Helsinki, Finland)
We give an overview of SAT Competition 2016, the 2016 edition of thefamous competition for Boolean satisfiability (SAT) solvers with over 20 years of history. A key aim is to point out ``what's hot'' in SAT competitions in 2016, i.e., new developments in thecompetition series, including new competition tracks and new solver techniquesimplemented in some of the award-winning solvers.
Entropic Causal Inference
Kocaoglu, Murat (The University of Texas at Austin) | Dimakis, Alexandros G. (The University of Texas at Austin) | Vishwanath, Sriram (The University of Texas at Austin) | Hassibi, Babak (California Institute of Technology)
We consider the problem of identifying the causal direction between two discrete random variables using observational data. Unlike previous work, we keep the most general functional model but make an assumption on the unobserved exogenous variable: Inspired by Occam's razor, we assume that the exogenous variable is simple in the true causal direction. We quantify simplicity using Renyi entropy. Our main result is that, under natural assumptions, if the exogenous variable has low H0 entropy (cardinality) in the true direction, it must have high H0 entropy in the wrong direction. We establish several algorithmic hardness results about estimating the minimum entropy exogenous variable. We show that the problem of finding the exogenous variable with minimum H1 entropy (Shannon Entropy) is equivalent to the problem of finding minimum joint entropy given n marginal distributions, also known as minimum entropy coupling problem. We propose an efficient greedy algorithm for the minimum entropy coupling problem, that for n=2 provably finds a local optimum. This gives a greedy algorithm for finding the exogenous variable with minimum Shannon entropy. Our greedy entropy-based causal inference algorithm has similar performance to the state of the art additive noise models in real datasets. One advantage of our approach is that we make no use of the values of random variables but only their distributions. Our method can therefore be used for causal inference for both ordinal and also categorical data, unlike additive noise models.
Group Activity Selection on Social Networks
Igarashi, Ayumi (University of Oxford) | Peters, Dominik (University of Oxford) | Elkind, Edith (University of Oxford)
We propose a new variant of the group activity selection problem (GASP), where the agents are placed on a social network and activities can only be assigned to connected subgroups. We show that if multiple groupscan simultaneously engage in the same activity, finding a stable outcome is easy as long as the networkis acyclic. In contrast, if each activity can be assigned to a single group only, finding stable outcomes becomes computationally intractable, even if the underlying network is very simple: the problem of determining whether a given instance of a GASP admits a Nash stable outcome turns out to be NP-hard when the social network is a path, a star, or if the size of each connected component is bounded by a constant.On the other hand, we obtain fixed-parameter tractability results for this problem with respectto the number of activities.
Machine Learning
Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data โ such algorithms overcome following strictly static program instructions by making data driven predictions or decisions, through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms is infeasible; example applications include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision. Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses in prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field.
An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples
Machine Learning (ML) is coming into its own, with a growing recognition that ML can play a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. ML provides potential solutions in all these domains and more, and is set to be a pillar of our future civilization. The supply of able ML designers has yet to catch up to this demand. A major reason for this is that ML is just plain tricky. This tutorial introduces the basics of Machine Learning theory, laying down the common themes and concepts, making it easy to follow the logic and get comfortable with the topic. So what exactly is "machine learning" anyway?
Multi-step learning and underlying structure in statistical models
In multi-step learning, where a final learning task is accomplished via a sequence of intermediate learning tasks, the intuition is that successive steps or levels transform the initial data into representations more and more ``suited" to the final learning task. A related principle arises in transfer-learning where Baxter (2000) proposed a theoretical framework to study how learning multiple tasks transforms the inductive bias of a learner. The most widespread multi-step learning approach is semi-supervised learning with two steps: unsupervised, then supervised. Several authors (Castelli-Cover, 1996; Balcan-Blum, 2005; Niyogi, 2008; Ben-David et al, 2008; Urner et al, 2011) have analyzed SSL, with Balcan-Blum (2005) proposing a version of the PAC learning framework augmented by a ``compatibility function" to link concept class and unlabeled data distribution. We propose to analyze SSL and other multi-step learning approaches, much in the spirit of Baxter's framework, by defining a learning problem generatively as a joint statistical model on $X \times Y$. This determines in a natural way the class of conditional distributions that are possible with each marginal, and amounts to an abstract form of compatibility function. It also allows to analyze both discrete and non-discrete settings. As tool for our analysis, we define a notion of $\gamma$-uniform shattering for statistical models. We use this to give conditions on the marginal and conditional models which imply an advantage for multi-step learning approaches. In particular, we recover a more general version of a result of Poggio et al (2012): under mild hypotheses a multi-step approach which learns features invariant under successive factors of a finite group of invariances has sample complexity requirements that are additive rather than multiplicative in the size of the subgroups.
Supervised learning through the lens of compression
David, Ofir, Moran, Shay, Yehudayoff, Amir
This work continues the study of the relationship between sample compression schemes and statistical learning, which has been mostly investigated within the framework of binary classification. We first extend the investigation to multiclass categorization: we prove that in this case learnability is equivalent to compression of logarithmic sample size and that the uniform convergence property implies compression of constant size. We use the compressibility-learnability equivalence to show that (i) for multiclass categorization, PAC and agnostic PAC learnability are equivalent, and (ii) to derive a compactness theorem for learnability. We then consider supervised learning under general loss functions: we show that in this case, in order to maintain the compressibility-learnability equivalence, it is necessary to consider an approximate variant of compression. We use it to show that PAC and agnostic PAC are not equivalent, even when the loss function has only three values.
On the Recursive Teaching Dimension of VC Classes
Chen, Xi, Chen, Xi, Cheng, Yu, Tang, Bo
The recursive teaching dimension (RTD) of a concept class $C \subseteq \{0, 1\}^n$, introduced by Zilles et al. [ZLHZ11], is a complexity parameter measured by the worst-case number of labeled examples needed to learn any target concept of $C$ in the recursive teaching model. In this paper, we study the quantitative relation between RTD and the well-known learning complexity measure VC dimension (VCD), and improve the best known upper and (worst-case) lower bounds on the recursive teaching dimension with respect to the VC dimension. Given a concept class $C \subseteq \{0, 1\}^n$ with $VCD(C) = d$, we first show that $RTD(C)$ is at most $d 2^{d+1}$. This is the first upper bound for $RTD(C)$ that depends only on $VCD(C)$, independent of the size of the concept class $|C|$ and its~domain size $n$. Before our work, the best known upper bound for $RTD(C)$ is $O(d 2^d \log \log |C|)$, obtained by Moran et al. [MSWY15]. We remove the $\log \log |C|$ factor. We also improve the lower bound on the worst-case ratio of $RTD(C)$ to $VCD(C)$. We present a family of classes $\{ C_k \}_{k \ge 1}$ with $VCD(C_k) = 3k$ and $RTD(C_k)=5k$, which implies that the ratio of $RTD(C)$ to $VCD(C)$ in the worst case can be as large as $5/3$. Before our work, the largest ratio known was $3/2$ as obtained by Kuhlmann [Kuh99]. Since then, no finite concept class $C$ has been known to satisfy $RTD(C) > (3/2) VCD(C)$.