Computational Learning Theory
Machine Learning Engineer posted by Technica Corporation on DigitalMediaJobsNetwork.com
MS or PhD in Mathematics, Physics, Computer Science with a specialization in data analysis/machine learning/data science, strongly preferred Ability to apply a broad range of algorithms to varied data science problems Expertise in machine learning theory and practice with a solid understanding of machine learning algorithms Experience in the following: Applying regression, classification and clustering algorithms to varied types of data Supervised and unsupervised learning Use of data science languages such as R, Python, etc Building and testing predictive models Applying very large amounts of training data sets to train models Working knowledge of various text mining algorithms and their use-cases [e.g., keyword extraction, PLSA, LDA, HMM, CRF, deep learning and recurrent ANN, word2vec/doc2vec and Bayesian modeling] Strong understanding of text pre-processing and normalization techniques, such as tokenization, POS tagging and parsing and how they work at a low level Strong understanding of testing and tuning models
Fast Rates for General Unbounded Loss Functions: from ERM to Generalized Bayes
Grünwald, Peter D., Mehta, Nishant A.
We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to $\eta$-generalized Bayesian, MDL, and ERM estimators. When applied with log loss, the bounds imply convergence rates for generalized Bayesian inference under misspecification in terms of a generalization of the Hellinger metric as long as the learning rate $\eta$ is set correctly. For general loss functions, our bounds rely on two separate conditions: the $v$-GRIP (generalized reversed information projection) conditions, which control the lower tail of the excess loss; and the newly introduced witness condition, which controls the upper tail. The parameter $v$ in the $v$-GRIP conditions determines the achievable rate and is akin to the exponent in the well-known Tsybakov margin condition and the Bernstein condition for bounded losses, which the $v$-GRIP conditions generalize; favorable $v$ in combination with small model complexity leads to $\tilde{O}(1/n)$ rates. The witness condition allows us to connect the excess risk to an 'annealed' version thereof, by which we generalize several previous results connecting Hellinger and R\'enyi divergence to KL divergence.
The Vapnik-Chervonenkis dimension of cubes in $\mathbb{R}^d$
The Vapnik-Chervonenkis (VC) dimension of a collection of subsets of a set is an important combinatorial concept in settings such as discrete geometry and machine learning. In this paper we prove that the VC dimension of the family of $d$-dimensional cubes in $\mathbb R^d$ is $\lfloor(3d+1)/2\rfloor$.
Kullback-Leibler Principal Component for Tensors is not NP-hard
Huang, Kejun, Sidiropoulos, Nicholas D.
We study the problem of nonnegative rank-one approximation of a nonnegative tensor, and show that the globally optimal solution that minimizes the generalized Kullback-Leibler divergence can be efficiently obtained, i.e., it is not NP-hard. This result works for arbitrary nonnegative tensors with an arbitrary number of modes (including two, i.e., matrices). We derive a closed-form expression for the KL principal component, which is easy to compute and has an intuitive probabilistic interpretation. For generalized KL approximation with higher ranks, the problem is for the first time shown to be equivalent to multinomial latent variable modeling, and an iterative algorithm is derived that resembles the expectation-maximization algorithm. On the Iris dataset, we showcase how the derived results help us learn the model in an \emph{unsupervised} manner, and obtain strikingly close performance to that from supervised methods.
Rate-Distortion Bounds on Bayes Risk in Supervised Learning
Nokleby, Matthew, Beirami, Ahmad, Calderbank, Robert
We present an information-theoretic framework for bounding the number of labeled samples needed to train a classifier in a parametric Bayesian setting. We derive bounds on the average $L_p$ distance between the learned classifier and the true maximum a posteriori classifier, which are well-established surrogates for the excess classification error due to imperfect learning. We provide lower and upper bounds on the rate-distortion function, using $L_p$ loss as the distortion measure, of a maximum a priori classifier in terms of the differential entropy of the posterior distribution and a quantity called the interpolation dimension, which characterizes the complexity of the parametric distribution family. In addition to expressing the information content of a classifier in terms of lossy compression, the rate-distortion function also expresses the minimum number of bits a learning machine needs to extract from training data to learn a classifier to within a specified $L_p$ tolerance. We use results from universal source coding to express the information content in the training data in terms of the Fisher information of the parametric family and the number of training samples available. The result is a framework for computing lower bounds on the Bayes $L_p$ risk. This framework complements the well-known probably approximately correct (PAC) framework, which provides minimax risk bounds involving the Vapnik-Chervonenkis dimension or Rademacher complexity. Whereas the PAC framework provides upper bounds the risk for the worst-case data distribution, the proposed rate-distortion framework lower bounds the risk averaged over the data distribution. We evaluate the bounds for a variety of data models, including categorical, multinomial, and Gaussian models. In each case the bounds are provably tight orderwise, and in two cases we prove that the bounds are tight up to multiplicative constants.
An efficient quantum algorithm for generative machine learning
Gao, Xun, Zhang, Zhengyu, Duan, Luming
A central task in the field of quantum computing is to find applications where quantum computer could provide exponential speedup over any classical computer. Machine learning represents an important field with broad applications where quantum computer may offer significant speedup. Several quantum algorithms for discriminative machine learning have been found based on efficient solving of linear algebraic problems, with potential exponential speedup in runtime under the assumption of effective input from a quantum random access memory. In machine learning, generative models represent another large class which is widely used for both supervised and unsupervised learning. Here, we propose an efficient quantum algorithm for machine learning based on a quantum generative model. We prove that our proposed model is exponentially more powerful to represent probability distributions compared with classical generative models and has exponential speedup in training and inference at least for some instances under a reasonable assumption in computational complexity theory. Our result opens a new direction for quantum machine learning and offers a remarkable example in which a quantum algorithm shows exponential improvement over any classical algorithm in an important application field.
Sampling and Reconstruction of Graph Signals via Weak Submodularity and Semidefinite Relaxation
Hashemi, Abolfazl, Shafipour, Rasoul, Vikalo, Haris, Mateos, Gonzalo
We study the problem of sampling a bandlimited graph signal in the presence of noise, where the objective is to select a node subset of prescribed cardinality that minimizes the signal reconstruction mean squared error (MSE). To that end, we formulate the task at hand as the minimization of MSE subject to binary constraints, and approximate the resulting NP-hard problem via semidefinite programming (SDP) relaxation. Moreover, we provide an alternative formulation based on maximizing a monotone weak submodular function and propose a randomized-greedy algorithm to find a sub-optimal subset. We then derive a worst-case performance guarantee on the MSE returned by the randomized greedy algorithm for general non-stationary graph signals. The efficacy of the proposed methods is illustrated through numerical simulations on synthetic and real-world graphs. Notably, the randomized greedy algorithm yields an order-of-magnitude speedup over state-of-the-art greedy sampling schemes, while incurring only a marginal MSE performance loss.
Approximation Algorithms for $\ell_0$-Low Rank Approximation
Bringmann, Karl, Kolev, Pavel, Woodruff, David P.
We study the $\ell_0$-Low Rank Approximation Problem, where the goal is, given an $m \times n$ matrix $A$, to output a rank-$k$ matrix $A'$ for which $\|A'-A\|_0$ is minimized. Here, for a matrix $B$, $\|B\|_0$ denotes the number of its non-zero entries. This NP-hard variant of low rank approximation is natural for problems with no underlying metric, and its goal is to minimize the number of disagreeing data positions. We provide approximation algorithms which significantly improve the running time and approximation factor of previous work. For $k > 1$, we show how to find, in poly$(mn)$ time for every $k$, a rank $O(k \log(n/k))$ matrix $A'$ for which $\|A'-A\|_0 \leq O(k^2 \log(n/k)) \mathrm{OPT}$. To the best of our knowledge, this is the first algorithm with provable guarantees for the $\ell_0$-Low Rank Approximation Problem for $k > 1$, even for bicriteria algorithms. For the well-studied case when $k = 1$, we give a $(2+\epsilon)$-approximation in {\it sublinear time}, which is impossible for other variants of low rank approximation such as for the Frobenius norm. We strengthen this for the well-studied case of binary matrices to obtain a $(1+O(\psi))$-approximation in sublinear time, where $\psi = \mathrm{OPT}/\lVert A\rVert_0$. For small $\psi$, our approximation factor is $1+o(1)$.
An Approach to One-Bit Compressed Sensing Based on Probably Approximately Correct Learning Theory
Ahsen, Mehmet Eren, Vidyasagar, Mathukumalli
In this paper, the problem of one-bit compressed sensing (OBCS) is formulated as a problem in probably approximately correct (PAC) learning. It is shown that the Vapnik-Chervonenkis (VC-) dimension of the set of half-spaces in $\mathbb{R}^n$ generated by $k$-sparse vectors is bounded below by $k \lg (n/k)$ and above by $2k \lg (n/k)$, plus some round-off terms. By coupling this estimate with well-established results in PAC learning theory, we show that a consistent algorithm can recover a $k$-sparse vector with $O(k \lg (n/k))$ measurements, given only the signs of the measurement vector. This result holds for \textit{all} probability measures on $\mathbb{R}^n$. It is further shown that random sign-flipping errors result only in an increase in the constant in the $O(k \lg (n/k))$ estimate. Because constructing a consistent algorithm is not straight-forward, we present a heuristic based on the $\ell_1$-norm support vector machine, and illustrate that its computational performance is superior to a currently popular method.
A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity
Grünwald, Peter D., Mehta, Nishant A.
We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexity (also known as stochastic or PAC-Bayesian, $\mathrm{KL}(\text{posterior} \operatorname{\|} \text{prior})$ complexity. For (penalized) ERM, the new complexity reduces to (generalized) normalized maximum likelihood (NML) complexity, i.e. a minimax log-loss individual-sequence regret. Our first main result bounds excess risk in terms of the new complexity. Our second main result links the new complexity via Rademacher complexity to $L_2(P)$ entropy, thereby generalizing earlier results of Opper, Haussler, Lugosi, and Cesa-Bianchi who did the log-loss case with $L_\infty$. Together, these results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity.