Computational Learning Theory
Textbook on the *theory* of neural nets/ML algorithms?
Foundations of Machine Learning, by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, is a 2012 book on machine learning theory. Understanding Machine Learning: From Theory to Algorithms, by Shai Shalev-Shwartz and Shai Ben-David, is a similar 2014 book that's fairly well-known and targeted a little more introductory than Mohri/Rostamizadeh/Talwalkar, but still has lots of theory in it. Neural Network Learning: Theoretical Foundations, by Martin Anthony and Peter Bartlett, is a 1999 book about ML theory phrased as being about neural networks, but (to my impression not having read it) is mostly about ML theory in general. These three books mostly take the predominant viewpoint of statistical learning theory. There is also an interesting point of view called computational learning theory, inspired more by computer science theory.
An improvement of the convergence proof of the ADAM-Optimizer
Bock, Sebastian, Goppold, Josef, Weiร, Martin
A common way to train neural networks is the Backpropagation. This algorithm includes a gradient descent method, which needs an adaptive step size. In the area of neural networks, the ADAM-Optimizer is one of the most popular adaptive step size methods. It was invented in \cite{Kingma.2015} by Kingma and Ba. The $5865$ citations in only three years shows additionally the importance of the given paper. We discovered that the given convergence proof of the optimizer contains some mistakes, so that the proof will be wrong. In this paper we give an improvement to the convergence proof of the ADAM-Optimizer.
High-dimensional Penalty Selection via Minimum Description Length Principle
Miyaguchi, Kohei, Yamanishi, Kenji
We tackle the problem of penalty selection of regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood(LNML)-minimization approach is favorable, because LNML quantifies the goodness of regularized models with any forms of penalty functions in view of the minimum description length principle, and guides us to a good penalty function through the high-dimensional space. However, the minimization of LNML entails two major challenges: 1) the computation of the normalizing factor of LNML and 2) its minimization in high-dimensional spaces. In this paper, we present a novel regularization selection method (MDL-RS), in which a tight upper bound of LNML (uLNML) is minimized with local convergence guarantee. Our main contribution is the derivation of uLNML, which is a uniform-gap upper bound of LNML in an analytic expression. This solves the above challenges in an approximate manner because it allows us to accurately approximate LNML and then efficiently minimize it. The experimental results show that MDL-RS improves the generalization performance of regularized estimates specifically when the model has redundant parameters.
Search versus Decision: The Opacity of Backbones and Backdoors Under a Weak Assumption
Hemaspaandra, Lane A., Narvรกez, David E.
Backdoors and backbones of Boolean formulas are hidden structural properties. A natural goal, already in part realized, is that solver algorithms seek to obtain substantially better performance by exploiting these structures. However, the present paper is not intended to improve the performance of SAT solvers, but rather is a cautionary paper. In particular, the theme of this paper is that there is a potential chasm between the existence of such structures in the Boolean formula and being able to effectively exploit them. This does not mean that these structures are not useful to solvers. It does mean that one must be very careful not to assume that it is computationally easy to go from the existence of a structure to being able to get one's hands on it and/or being able to exploit the structure. For example, in this paper we show that, under the assumption that P $\neq$ NP, there are easily recognizable families of Boolean formulas with strong backdoors that are easy to find, yet for which it is hard (in fact, NP-complete) to determine whether the formulas are satisfiable. We also show that, also under the assumption P $\neq$ NP, there are easily recognizable sets of Boolean formulas for which it is hard (in fact, NP-complete) to determine whether they have a large backbone.
Online Improper Learning with an Approximation Oracle
Hazan, Elad, Hu, Wei, Li, Yuanzhi, Li, Zhiyuan
One of the most fundamental and well-studied questions in learning theory is whether one can learn a given problem using an optimization oracle. For online learning in games, it was shown by Kalai and Vempala (2005) that an optimization oracle giving the best decision in hindsight is sufficient for attaining optimal regret. However, in many non-convex settings, such an optimization oracle is either unavailable or NPhard to compute. In contrast, in many such cases, efficient approximation algorithms are usually known, and are guaranteed to return a solution within a certain multiplicative factor of the optimum.
Finite Biased Teaching with Infinite Concept Classes
Hernandez-Orallo, Jose, Telle, Jan Arne
We investigate the teaching of infinite concept classes through the effect of the learning bias (which is used by the learner to prefer some concepts over others and by the teacher to devise the teaching examples) and the sampling bias (which determines how the concepts are sampled from the class). We analyse two important classes: Turing machines and finite-state machines. We derive bounds for the biased teaching dimension when the learning bias is derived from a complexity measure (Kolmogorov complexity and minimal number of states respectively) and analyse the sampling distributions that lead to finite expected biased teaching dimensions. We highlight the existing trade-off between the bound and the representativeness of the sample, and its implications for the understanding of what teaching rich concepts to machines entails.
VC-Dimension Based Generalization Bounds for Relational Learning
Kuzelka, Ondrej, Wang, Yuyi, Schockaert, Steven
In one of the most common settings in statistical relational learning (SRL), we are given a single relational structure from which we need to learn a model, and this model is then used to make predictions on previously unseen structures. For example, the global relational structure could correspond to a large social network, with the training data specifying the relationships that hold among a small subset of the users, along with their attributes. Clearly, in order to provide any guarantees on the accuracy of these predictions, we need to make (simplifying) assumptions about how the training and test structures are related. In this paper, we follow the setting from [10, 9], where it is assumed that these structures, which we will call relational examples, are all obtained by sampling domain elements from a larger global structure (uniformly and without replacement).
Five Most Popular Open Source Frameworks Used in Machine Learning Analytics Insight
Machine language a branch of artificial intelligence which enables system the ability to learn from data without being programmed. Machine learning got evolved from pattern recognition and computational learning theory in artificial intelligence. It has revolutionized the conventional way through developing algorithms that can learn and make predictions on data. There are innumerable factors that have improved the contribution of machine learning. Open source frameworks are one of the major reasons for the boost in machine learning. A framework is a collection of programs, libraries and languages evolved to use in application development.
A Direct Sum Result for the Information Complexity of Learning
Nachum, Ido, Shafer, Jonathan, Yehudayoff, Amir
How many bits of information are required to PAC learn a class of hypotheses of VC dimension $d$? The mathematical setting we follow is that of Bassily et al. (2018), where the value of interest is the mutual information $\mathrm{I}(S;A(S))$ between the input sample $S$ and the hypothesis outputted by the learning algorithm $A$. We introduce a class of functions of VC dimension $d$ over the domain $\mathcal{X}$ with information complexity at least $\Omega\left(d\log \log \frac{|\mathcal{X}|}{d}\right)$ bits for any consistent and proper algorithm (deterministic or random). Bassily et al. proved a similar (but quantitatively weaker) result for the case $d=1$. The above result is in fact a special case of a more general phenomenon we explore. We define the notion of information complexity of a given class of functions $\mathcal{H}$. Intuitively, it is the minimum amount of information that an algorithm for $\mathcal{H}$ must retain about its input to ensure consistency and properness. We prove a direct sum result for information complexity in this context; roughly speaking, the information complexity sums when combining several classes.