Goto

Collaborating Authors

 Gradient Descent


The Power of Normalization: Faster Evasion of Saddle Points

arXiv.org Machine Learning

A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.


Which is your favorite Machine Learning Algorithm?

#artificialintelligence

Developed back in the 50s by Rosenblatt and colleagues, this extremely simple algorithm can be viewed as the foundation for some of the most successful classifiers today, including suport vector machines and logistic regression, solved using stochastic gradient descent. The convergence proof for the Perceptron algorithm is one of the most elegant pieces of math I've seen in ML. Most useful: Boosting, especially boosted decision trees. This intuitive approach allows you to build highly accurate ML models, by combining many simple ones. Boosting is one of the most practical methods in ML, it's widely used in industry, can handle a wide variety of data types, and can be implemented at scale.



Benchmarking Quantum Hardware for Training of Fully Visible Boltzmann Machines

arXiv.org Machine Learning

Quantum annealing (QA) is a hardware-based heuristic optimization and sampling method applicable to discrete undirected graphical models. While similar to simulated annealing, QA relies on quantum, rather than thermal, effects to explore complex search spaces. For many classes of problems, QA is known to offer computational advantages over simulated annealing. Here we report on the ability of recent QA hardware to accelerate training of fully visible Boltzmann machines. We characterize the sampling distribution of QA hardware, and show that in many cases, the quantum distributions differ significantly from classical Boltzmann distributions. In spite of this difference, training (which seeks to match data and model statistics) using standard classical gradient updates is still effective. We investigate the use of QA for seeding Markov chains as an alternative to contrastive divergence (CD) and persistent contrastive divergence (PCD). Using $k=50$ Gibbs steps, we show that for problems with high-energy barriers between modes, QA-based seeds can improve upon chains with CD and PCD initializations. For these hard problems, QA gradient estimates are more accurate, and allow for faster learning. Furthermore, and interestingly, even the case of raw QA samples (that is, $k=0$) achieved similar improvements. We argue that this relates to the fact that we are training a quantum rather than classical Boltzmann distribution in this case. The learned parameters give rise to hardware QA distributions closely approximating classical Boltzmann distributions that are hard to train with CD/PCD.


Practical Secure Aggregation for Federated Learning on User-Held Data

arXiv.org Machine Learning

Secure Aggregation protocols allow a collection of mutually distrust parties, each holding a private value, to collaboratively compute the sum of those values without revealing the values themselves. We consider training a deep neural network in the Federated Learning model, using distributed stochastic gradient descent across user-held training data on mobile devices, wherein Secure Aggregation protects each user's model gradient. We design a novel, communication-efficient Secure Aggregation protocol for high-dimensional data that tolerates up to 1/3 users failing to complete the protocol. For 16-bit input values, our protocol offers 1.73x communication expansion for $2^{10}$ users and $2^{20}$-dimensional vectors, and 1.98x expansion for $2^{14}$ users and $2^{24}$ dimensional vectors.


An overview of gradient descent optimization algorithms

@machinelearnbot

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent.


IAB Reveals Winners of Data Rockstar Awards

#artificialintelligence

IAB (Interactive Advertising Bureau) and its Data Center of Excellence today announced the winners of the inaugural IAB Data Rockstar Awards, celebrating top industry leaders and practitioners who have demonstrated achievement in data science or technology. The top finalists were selected by the IAB Data Center of Excellence Board of Directors and were evaluated based on demonstrated excellence, creativity or forward-thinking approaches to solving problems in data science, as well as the impact their contributions have made to their company or industry. Chalasani developed a highly efficient, distributed, extreme-scale, single-pass online logistic regression learning system in Scala/Spark, using variants of Stochastic Gradient Descent, capable of handling hundreds of millions of sparse features and billions of training observations. His system incorporates a number of state-of-the-art techniques that do not exist together in any other machine learning system, including adaptive feature-scaling, adaptive gradients, feature-interactions and feature-hashing. Chalasani work is central to MediaMath's vision for every addressable interaction between a marketer and a consumer to be driven by Machine Learning optimization against all available, relevant data at that moment, to maximize long-term marketer business outcomes.


tdeboissiere/DeepLearningImplementations

@machinelearnbot

This is a keras implementation of Improving Stochastic Gradient Descent With Feedback. Check this page for the authors' original implementation of Eve. Or copy the Eve class to keras/optimizers.py and use it as any other optimizer.


Recursive Decomposition for Nonconvex Optimization

arXiv.org Machine Learning

Continuous optimization is an important problem in many areas of AI, including vision, robotics, probabilistic inference, and machine learning. Unfortunately, most real-world optimization problems are nonconvex, causing standard convex techniques to find only local optima, even with extensions like random restarts and simulated annealing. We observe that, in many cases, the local modes of the objective function have combinatorial structure, and thus ideas from combinatorial optimization can be brought to bear. Based on this, we propose a problem-decomposition approach to nonconvex optimization. Similarly to DPLL-style SAT solvers and recursive conditioning in probabilistic inference, our algorithm, RDIS, recursively sets variables so as to simplify and decompose the objective function into approximately independent sub-functions, until the remaining functions are simple enough to be optimized by standard techniques like gradient descent. The variables to set are chosen by graph partitioning, ensuring decomposition whenever possible. We show analytically that RDIS can solve a broad class of nonconvex optimization problems exponentially faster than gradient descent with random restarts. Experimentally, RDIS outperforms standard techniques on problems like structure from motion and protein folding.


SGD with Variance Reduction beyond Empirical Risk Minimization

arXiv.org Machine Learning

We introduce a doubly stochastic proximal gradient algorithm for optimizing a finite average of smooth convex functions, whose gradients depend on numerically expensive expectations. Our main motivation is the acceleration of the optimization of the regularized Cox partial-likelihood (the core model used in survival analysis), but our algorithm can be used in different settings as well. The proposed algorithm is doubly stochastic in the sense that gradient steps are done using stochastic gradient descent (SGD) with variance reduction, where the inner expectations are approximated by a Monte-Carlo Markov-Chain (MCMC) algorithm. We derive conditions on the MCMC number of iterations guaranteeing convergence, and obtain a linear rate of convergence under strong convexity and a sublinear rate without this assumption. We illustrate the fact that our algorithm improves the state-of-the-art solver for regularized Cox partial-likelihood on several datasets from survival analysis.