Computational Learning Theory
Finite-sample risk bounds for maximum likelihood estimation with arbitrary penalties
Brinda, W. D., Klusowski, Jason M.
Remarkably general method for bounding the statistical risk of penalized likelihood estimators comes from work on two-part coding, one of the minimum description length (MDL) approaches to statistical inference. Two-part coding MDL prescribes assigning codelengths to a model (or model class) then selecting the distribution that provides the most efficient description of one's data [1]. The total description length has two parts: the part that specifies a distribution within the model (as well as a model within the model class if necessary) and the part that specifies the data with reference to the specified distribution. If the codelengths are exactly Kraft-valid, this approach is equivalent to Bayesian maximum a posteriori (MAP) estimation, in that the two parts correspond to log reciprocal of prior and log reciprocal of likelihood respectively. More generally, one can call the part of the codelength specifying the distribution a penalty term; it is called the complexity in MDL literature. Let (Θ, L) denote a discrete set indexing distributions along with a complexity function. With X P, the (pointwise) redundancy of any θ Θ is its two-part codelength minus log(1/p(X)), the codelength one gets by using P as the coding distribution.
Fast Rates for General Unbounded Loss Functions: from ERM to Generalized Bayes
Grünwald, Peter D., Mehta, Nishant A.
We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to $\eta$-generalized Bayesian, MDL, and ERM estimators. When applied with log loss, the bounds imply convergence rates for generalized Bayesian inference under misspecification in terms of a generalization of the Hellinger metric as long as the learning rate $\eta$ is set correctly. For general loss functions, our bounds rely on two separate conditions: the $v$-GRIP (generalized reversed information projection) conditions, which control the lower tail of the excess loss; and the newly introduced witness condition, which controls the upper tail. The parameter $v$ in the $v$-GRIP conditions determines the achievable rate and is akin to the exponent in the well-known Tsybakov margin condition and the Bernstein condition for bounded losses, which the $v$-GRIP conditions generalize; favorable $v$ in combination with small model complexity leads to $\tilde{O}(1/n)$ rates. The witness condition allows us to connect the excess risk to an 'annealed' version thereof, by which we generalize several previous results connecting Hellinger and R\'enyi divergence to KL divergence.
The Vapnik-Chervonenkis dimension of cubes in $\mathbb{R}^d$
The Vapnik-Chervonenkis (VC) dimension of a collection of subsets of a set is an important combinatorial concept in settings such as discrete geometry and machine learning. In this paper we prove that the VC dimension of the family of $d$-dimensional cubes in $\mathbb R^d$ is $\lfloor(3d+1)/2\rfloor$.
Kullback-Leibler Principal Component for Tensors is not NP-hard
Huang, Kejun, Sidiropoulos, Nicholas D.
We study the problem of nonnegative rank-one approximation of a nonnegative tensor, and show that the globally optimal solution that minimizes the generalized Kullback-Leibler divergence can be efficiently obtained, i.e., it is not NP-hard. This result works for arbitrary nonnegative tensors with an arbitrary number of modes (including two, i.e., matrices). We derive a closed-form expression for the KL principal component, which is easy to compute and has an intuitive probabilistic interpretation. For generalized KL approximation with higher ranks, the problem is for the first time shown to be equivalent to multinomial latent variable modeling, and an iterative algorithm is derived that resembles the expectation-maximization algorithm. On the Iris dataset, we showcase how the derived results help us learn the model in an \emph{unsupervised} manner, and obtain strikingly close performance to that from supervised methods.
Rate-Distortion Bounds on Bayes Risk in Supervised Learning
Nokleby, Matthew, Beirami, Ahmad, Calderbank, Robert
We present an information-theoretic framework for bounding the number of labeled samples needed to train a classifier in a parametric Bayesian setting. We derive bounds on the average $L_p$ distance between the learned classifier and the true maximum a posteriori classifier, which are well-established surrogates for the excess classification error due to imperfect learning. We provide lower and upper bounds on the rate-distortion function, using $L_p$ loss as the distortion measure, of a maximum a priori classifier in terms of the differential entropy of the posterior distribution and a quantity called the interpolation dimension, which characterizes the complexity of the parametric distribution family. In addition to expressing the information content of a classifier in terms of lossy compression, the rate-distortion function also expresses the minimum number of bits a learning machine needs to extract from training data to learn a classifier to within a specified $L_p$ tolerance. We use results from universal source coding to express the information content in the training data in terms of the Fisher information of the parametric family and the number of training samples available. The result is a framework for computing lower bounds on the Bayes $L_p$ risk. This framework complements the well-known probably approximately correct (PAC) framework, which provides minimax risk bounds involving the Vapnik-Chervonenkis dimension or Rademacher complexity. Whereas the PAC framework provides upper bounds the risk for the worst-case data distribution, the proposed rate-distortion framework lower bounds the risk averaged over the data distribution. We evaluate the bounds for a variety of data models, including categorical, multinomial, and Gaussian models. In each case the bounds are provably tight orderwise, and in two cases we prove that the bounds are tight up to multiplicative constants.
[D] What would you include in a first ML course? • r/MachineLearning
Once you have some basic algorithms/examples to work with, you can use them to explain the main issues of over/underfitting. Based on the level of the people, this can be used for discussing learning theory: what ERM is and why SRM might be better. From there you could easily cover the VC-Dimension and dive into SVMs.
An efficient quantum algorithm for generative machine learning
Gao, Xun, Zhang, Zhengyu, Duan, Luming
Duan 1,2 1 Center for Quantum Information, IIIS, Tsinghua University, Beijing 100084, PR China 2 Department of Physics, University of Michigan, Ann Arbor, Michigan 48109, USA A central task in the field of quantum computing is to find applications where quantum computer could provide exponential speedup over any classical computer [1-3]. Machine learning represents an important field with broad applications where quantum computer may offer significant speedup [4-8]. Several quantum algorithms for discriminative machine learning [9] have been found based on efficient solving of linear algebraic problems [10-15], with potential exponential speedup in runtime under the assumption of effective input from a quantum random access memory [16]. In machine learning, generative models represent another large class [9] which is widely used for both supervised and unsupervised learning [17, 18]. Here, we propose an efficient quantum algorithm for machine learning based on a quantum generative model. We prove that our proposed model is exponentially more powerful to represent probability distributions compared with classical generative models and has exponential speedup in training and inference at least for some instances under a reasonable assumption in computational complexity theory. Our result opens a new direction for quantum machine learning and offers a remarkable example in which a quantum algorithm shows exponential improvement over any classical algorithm in an important application field. Machine learning and artificial intelligence represent a very important application area which could be revolutionized by quantum computers with clever algorithms that offer exponential speedup [4, 5]. The candidate algorithms with potential exponential speedup so far rely on efficient quantum solution of linear system of equations or linear algebraic problems [12-15]. Those algorithms require quantum random access memory (QRAM) as a critical component in addition to a quantum computer. In a QRAM, the number of required quantum routers scales up exponentially with the number of qubits in those algorithms [16, 19]. This exponential overhead in resource requirement poses a significant challenge for its experimental implementation and is a caveat for fair comparison with corresponding classical algorithms [5, 20]. In this paper, we propose a quantum algorithm with potential exponential speedup for machine learning basedFigure 1: Classical and quantum generative models. A factor graph is a bipartite graph where one group of the vertices represent variables (denoted by circles) and the other group of vertices represent positive functions (denoted by squares) acting on connected variables. The corresponding probability distribution is given by the product of all these functions. Each variable connects to at most a constant number of functions which introduce correlations in the probability distribution.b,
Approximation Algorithms for $\ell_0$-Low Rank Approximation
Bringmann, Karl, Kolev, Pavel, Woodruff, David P.
We study the $\ell_0$-Low Rank Approximation Problem, where the goal is, given an $m \times n$ matrix $A$, to output a rank-$k$ matrix $A'$ for which $\|A'-A\|_0$ is minimized. Here, for a matrix $B$, $\|B\|_0$ denotes the number of its non-zero entries. This NP-hard variant of low rank approximation is natural for problems with no underlying metric, and its goal is to minimize the number of disagreeing data positions. We provide approximation algorithms which significantly improve the running time and approximation factor of previous work. For $k > 1$, we show how to find, in poly$(mn)$ time for every $k$, a rank $O(k \log(n/k))$ matrix $A'$ for which $\|A'-A\|_0 \leq O(k^2 \log(n/k)) \mathrm{OPT}$. To the best of our knowledge, this is the first algorithm with provable guarantees for the $\ell_0$-Low Rank Approximation Problem for $k > 1$, even for bicriteria algorithms. For the well-studied case when $k = 1$, we give a $(2+\epsilon)$-approximation in {\it sublinear time}, which is impossible for other variants of low rank approximation such as for the Frobenius norm. We strengthen this for the well-studied case of binary matrices to obtain a $(1+O(\psi))$-approximation in sublinear time, where $\psi = \mathrm{OPT}/\lVert A\rVert_0$. For small $\psi$, our approximation factor is $1+o(1)$.
An Approach to One-Bit Compressed Sensing Based on Probably Approximately Correct Learning Theory
Ahsen, Mehmet Eren, Vidyasagar, Mathukumalli
In this paper, the problem of one-bit compressed sensing (OBCS) is formulated as a problem in probably approximately correct (PAC) learning. It is shown that the Vapnik-Chervonenkis (VC-) dimension of the set of half-spaces in $\mathbb{R}^n$ generated by $k$-sparse vectors is bounded below by $k \lg (n/k)$ and above by $2k \lg (n/k)$, plus some round-off terms. By coupling this estimate with well-established results in PAC learning theory, we show that a consistent algorithm can recover a $k$-sparse vector with $O(k \lg (n/k))$ measurements, given only the signs of the measurement vector. This result holds for \textit{all} probability measures on $\mathbb{R}^n$. It is further shown that random sign-flipping errors result only in an increase in the constant in the $O(k \lg (n/k))$ estimate. Because constructing a consistent algorithm is not straight-forward, we present a heuristic based on the $\ell_1$-norm support vector machine, and illustrate that its computational performance is superior to a currently popular method.
A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity
Grünwald, Peter D., Mehta, Nishant A.
We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexity (also known as stochastic or PAC-Bayesian, $\mathrm{KL}(\text{posterior} \operatorname{\|} \text{prior})$ complexity. For (penalized) ERM, the new complexity reduces to (generalized) normalized maximum likelihood (NML) complexity, i.e. a minimax log-loss individual-sequence regret. Our first main result bounds excess risk in terms of the new complexity. Our second main result links the new complexity via Rademacher complexity to $L_2(P)$ entropy, thereby generalizing earlier results of Opper, Haussler, Lugosi, and Cesa-Bianchi who did the log-loss case with $L_\infty$. Together, these results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity.