Shalev-Shwartz, Shai
Subspace Learning with Partial Information
Gonen, Alon, Rosenbaum, Dan, Eldar, Yonina, Shalev-Shwartz, Shai
The goal of subspace learning is to find a $k$-dimensional subspace of $\mathbb{R}^d$, such that the expected squared distance between instance vectors and the subspace is as small as possible. In this paper we study subspace learning in a partial information setting, in which the learner can only observe $r \le d$ attributes from each instance vector. We propose several efficient algorithms for this task, and analyze their sample complexity
Learning Sparse Low-Threshold Linear Classifiers
Sabato, Sivan, Shalev-Shwartz, Shai, Srebro, Nathan, Hsu, Daniel, Zhang, Tong
We consider the problem of learning a non-negative linear classifier with a $1$-norm of at most $k$, and a fixed threshold, under the hinge-loss. This problem generalizes the problem of learning a $k$-monotone disjunction. We prove that we can learn efficiently in this setting, at a rate which is linear in both $k$ and the size of the threshold, and that this is the best possible rate. We provide an efficient online learning algorithm that achieves the optimal rate, and show that in the batch case, empirical risk minimization achieves this rate as well. The rates we show are tighter than the uniform convergence rate, which grows with $k^2$.
Beyond Convexity: Stochastic Quasi-Convex Optimization
Hazan, Elad, Levy, Kfir, Shalev-Shwartz, Shai
This poster has been moved from Monday #86 to Thursday #101. Stochastic convex optimization is a basic and well studied primitive in machine learning. It is well known that convex and Lipschitz functions can be minimized efficiently using Stochastic Gradient Descent (SGD).The Normalized Gradient Descent (NGD) algorithm, is an adaptation of Gradient Descent, which updates according to the direction of the gradients, rather than the gradients themselves. In this paper we analyze a stochastic version of NGD and prove its convergence to a global minimum for a wider class of functions: we require the functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens the concept of unimodality to multidimensions and allows for certain types of saddle points, which are a known hurdle for first-order optimization methods such as gradient descent. Locally-Lipschitz functions are only required to be Lipschitz in a small region around the optimum. This assumption circumvents gradient explosion, which is another known hurdle for gradient descent variants. Interestingly, unlike the vanilla SGD algorithm, the stochastic normalized gradient descent algorithm provably requires a minimal minibatch size.
On the Computational Efficiency of Training Neural Networks
Livni, Roi, Shalev-Shwartz, Shai, Shamir, Ohad
It is well-known that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGDand a variety of tricks that include different activation functions (e.g. ReLU), over-specification (i.e., train networks which are larger than needed), and regularization. In this paper we revisit the computational complexity of training neural networks from a modern perspective. We provide both positive and negative results,some of them yield new provably efficient and practical algorithms for training certain types of neural networks.
On the Computational Efficiency of Training Neural Networks
Livni, Roi, Shalev-Shwartz, Shai, Shamir, Ohad
It is well-known that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), over-specification (i.e., train networks which are larger than needed), and regularization. In this paper we revisit the computational complexity of training neural networks from a modern perspective. We provide both positive and negative results, some of them yield new provably efficient and practical algorithms for training certain types of neural networks.
An Algorithm for Training Polynomial Networks
Livni, Roi, Shalev-Shwartz, Shai, Shamir, Ohad
We consider deep neural networks, in which the output of each node is a quadratic function of its inputs. Similar to other deep architectures, these networks can compactly represent any function on a finite training set. The main goal of this paper is the derivation of an efficient layer-by-layer algorithm for training such networks, which we denote as the \emph{Basis Learner}. The algorithm is a universal learner in the sense that the training error is guaranteed to decrease at every iteration, and can eventually reach zero under mild conditions. We present practical implementations of this algorithm, as well as preliminary experimental results. We also compare our deep architecture to other shallow architectures for learning polynomials, in particular kernel learning.
Accelerated Mini-Batch Stochastic Dual Coordinate Ascent
Shalev-Shwartz, Shai, Zhang, Tong
Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007].
More data speeds up training time in learning halfspaces over sparse vectors
Daniely, Amit, Linial, Nati, Shalev-Shwartz, Shai
The increased availability of data in recent years led several authors to ask whether it is possible to use data as a {\em computational} resource. That is, if more data is available, beyond the sample complexity limit, is it possible to use the extra examples to speed up the computation time required to perform the learning task? We give the first positive answer to this question for a {\em natural supervised learning problem} --- we consider agnostic PAC learning of halfspaces over $3$-sparse vectors in $\{-1,1,0\}^n$. This class is inefficiently learnable using $O\left(n/\epsilon^2\right)$ examples. Our main contribution is a novel, non-cryptographic, methodology for establishing computational-statistical gaps, which allows us to show that, under a widely believed assumption that refuting random $\mathrm{3CNF}$ formulas is hard, efficiently learning this class using $O\left(n/\epsilon^2\right)$ examples is impossible. We further show that under stronger hardness assumptions, even $O\left(n^{1.499}/\epsilon^2\right)$ examples do not suffice. On the other hand, we show a new algorithm that learns this class efficiently using $\tilde{\Omega}\left(n^2/\epsilon^2\right)$ examples. This formally establishes the tradeoff between sample and computational complexity for a natural supervised learning problem.
Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization
Shalev-Shwartz, Shai, Zhang, Tong
We introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an inner-outer iteration procedure. We analyze the runtime of the framework and obtain rates that improve state-of-the-art results for various key machine learning optimization problems including SVM, logistic regression, ridge regression, Lasso, and multiclass SVM. Experiments validate our theoretical findings.
Accelerated Mini-Batch Stochastic Dual Coordinate Ascent
Shalev-Shwartz, Shai, Zhang, Tong
Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of \cite{nesterov2007gradient}.