Goto

Collaborating Authors

 Gradient Descent


Mini-Batch Spectral Clustering

arXiv.org Machine Learning

The cost of computing the spectrum of Laplacian matrices hinders the application of spectral clustering to large data sets. While approximations recover computational tractability, they can potentially affect clustering performance. This paper proposes a practical approach to learn spectral clustering based on adaptive stochastic gradient optimization. Crucially, the proposed approach recovers the exact spectrum of Laplacian matrices in the limit of the iterations, and the cost of each iteration is linear in the number of samples. Extensive experimental validation on data sets with up to half a million samples demonstrate its scalability and its ability to outperform state-of-the-art approximate methods to learn spectral clustering for a given computational budget.


Warm Starting Bayesian Optimization

arXiv.org Machine Learning

We develop a framework for warm-starting Bayesian optimization, that reduces the solution time required to solve an optimization problem that is one in a sequence of related problems. This is useful when optimizing the output of a stochastic simulator that fails to provide derivative information, for which Bayesian optimization methods are well-suited. Solving sequences of related optimization problems arises when making several business decisions using one optimization model and input data collected over different time periods or markets. While many gradient-based methods can be warm started by initiating optimization at the solution to the previous problem, this warm start approach does not apply to Bayesian optimization methods, which carry a full metamodel of the objective function from iteration to iteration. Our approach builds a joint statistical model of the entire collection of related objective functions, and uses a value of information calculation to recommend points to evaluate.


Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

arXiv.org Machine Learning

Several recent works have explored stochastic gradient methods for variational inference that exploit the geometry of the variational-parameter space. However, the theoretical properties of these methods are not well-understood and these methods typically only apply to conditionally-conjugate models. We present a new stochastic method for variational inference which exploits the geometry of the variational-parameter space and also yields simple closed-form updates even for non-conjugate models. We also give a convergence-rate analysis of our method and many other previous methods which exploit the geometry of the space. Our analysis generalizes existing convergence results for stochastic mirror-descent on non-convex objectives by using a more general class of divergence functions. Beyond giving a theoretical justification for a variety of recent methods, our experiments show that new algorithms derived in this framework lead to state of the art results on a variety of problems. Further, due to its generality, we expect that our theoretical analysis could also apply to other applications.


Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

arXiv.org Machine Learning

Stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods are Bayesian analogs to popular stochastic optimization methods; however, this connection is not well studied. We explore this relationship by applying simulated annealing to an SGMCMC algorithm. Furthermore, we extend recent SG-MCMC methods with two key components: i) adaptive preconditioners (as in ADAgrad or RMSprop), and ii) adaptive element-wise momentum weights. The zero-temperature limit gives a novel stochastic optimization method with adaptive element-wise momentum weights, while conventional optimization methods only have a shared, static momentum weight. Under certain assumptions, our theoretical analysis suggests the proposed simulated annealing approach converges close to the global optima. Experiments on several deep neural network models show state-of-the-art results compared to related stochastic optimization algorithms.


Neural Programmer: Inducing Latent Programs with Gradient Descent

arXiv.org Machine Learning

Deep neural networks have achieved impressive supervised classification performance in many tasks including image recognition, speech recognition, and sequence to sequence learning. However, this success has not been translated to applications like question answering that may involve complex arithmetic and logic reasoning. A major limitation of these models is in their inability to learn even simple arithmetic and logic operations. For example, it has been shown that neural networks fail to learn to add two binary numbers reliably. In this work, we propose Neural Programmer, an end-to-end differentiable neural network augmented with a small set of basic arithmetic and logic operations. Neural Programmer can call these augmented operations over several steps, thereby inducing compositional programs that are more complex than the built-in operations. The model learns from a weak supervision signal which is the result of execution of the correct program, hence it does not require expensive annotation of the correct program itself. The decisions of what operations to call, and what data segments to apply to are inferred by Neural Programmer. Such decisions, during training, are done in a differentiable fashion so that the entire network can be trained jointly by gradient descent. We find that training the model is difficult, but it can be greatly improved by adding random noise to the gradient. On a fairly complex synthetic table-comprehension dataset, traditional recurrent networks and attentional models perform poorly while Neural Programmer typically obtains nearly perfect accuracy.


Kernel Risk-Sensitive Loss: Definition, Properties and Application to Robust Adaptive Filtering

arXiv.org Machine Learning

Nonlinear similarity measures defined in kernel space, such as correntropy, can extract higher-order statistics of data and offer potentially significant performance improvement over their linear counterparts especially in non-Gaussian signal processing and machine learning. In this work, we propose a new similarity measure in kernel space, called the kernel risk-sensitive loss (KRSL), and provide some important properties. We apply the KRSL to adaptive filtering and investigate the robustness, and then develop the MKRSL algorithm and analyze the mean square convergence performance. Compared with correntropy, the KRSL can offer a more efficient performance surface, thereby enabling a gradient based method to achieve faster convergence speed and higher accuracy while still maintaining the robustness to outliers. Theoretical analysis results and superior performance of the new algorithm are confirmed by simulation.


Information-theoretical label embeddings for large-scale image classification

arXiv.org Machine Learning

We consider the problem of predicting to which classes an image belongs, where the number of classes is large (many thousands or tens of thousands) and where each image typically belongs to multiple classes that should all be properly identified: multi-label, massively multi-class classification. In such classification problems, the best practice until now (for instance in use at Google, Inc.) has been to use a deep convolutional neural network such as the ones described in [19] or [18], culminating in a logistic regression layer with a sigmoid cross-entropy loss, with target labels encoded as high-dimensional sparse binary vectors. The use of logistic regression implies an important yet oft overlooked assumption made about the label space: the classes are considered to be statistically independent, each class being treated as an independent dimension in the label space. This is generally not the case in practice: mirroring statistical dependencies found in the real world, label spaces often have a well-defined internal structure, with some labels being more likely to cooccur than other labels. For instance, "sky" and "beach" are frequently cooccurring labels, while "crane" and "manta ray" are rarely cooccurring. The sigmoid cross-entropy loss with sparse binary targets does not allow to leverage such observations about the structure of the label space. 1 There is therefore an opportunity to exploit the internal structure of the label space for gains in training speed, precision, and recall. One simple way to achieve this is to project the labels onto a lower-dimensional manifold -an embedding space-where a distance function between embedded labels would capture useful statistical dependencies. An appropriate loss function may then allow a parametric model trained via stochastic gradient descent to benefit from the structure of the manifold during training and inference.


Genetic algorithms and symbolic regression

#artificialintelligence

A few months ago, I wrote a post about optimization using gradient descent, which involves searching for a model that best meets certain criteria by repeatedly making adjustments that improve things a little bit at a time. In many situations, this works quite well and will always or almost always finds the best solution. But in other cases, it's possible for this approach to fall into a locally optimal solution that isn't the overall best, but is better than any nearby solution. A common way to deal with this sort of situation is to add some randomness into the algorithm, making it possible to jump out of one of these locally optimal solutions into a slightly worse solution that is adjacent to a much better one. In this post, I want to explore one such approach, called a genetic algorithm (or an evolutionary algorithm), which I'll illustrate with a specific type of genetic algorithm called symbolic regression.


An overview of gradient descent optimization algorithms

#artificialintelligence

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the most common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules. We will also take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting.


Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation

arXiv.org Machine Learning

It has been observed in a variety of contexts that gradient descent methods have great success in solving low-rank matrix factorization problems, despite the relevant problem formulation being non-convex. We tackle a particular instance of this scenario, where we seek the $d$-dimensional subspace spanned by a streaming data matrix. We apply the natural first order incremental gradient descent method, constraining the gradient method to the Grassmannian. In this paper, we propose an adaptive step size scheme that is greedy for the noiseless case, that maximizes the improvement of our metric of convergence at each data index $t$, and yields an expected improvement for the noisy case. We show that, with noise-free data, this method converges from any random initialization to the global minimum of the problem. For noisy data, we provide the expected convergence rate of the proposed algorithm per iteration.