Gradient Descent
Adaptive Online Gradient Descent
Hazan, Elad, Rakhlin, Alexander, Bartlett, Peter L.
We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between T and log T. Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.
Efficient multiple hyperparameter learning for log-linear models
Foo, Chuan-sheng, Do, Chuong B., Ng, Andrew Y.
Using multiple regularization hyperparameters is an effective method for managing model complexity in problems where input features have varying amounts of noise. While algorithms for choosing multiple hyperparameters are often used in neural networks and support vector machines, they are not common in structured prediction tasks, such as sequence labeling or parsing. In this paper, we consider the problem of learning regularization hyperparameters for log-linear models, a class of probabilistic models for structured prediction tasks which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning provides a significant boost in accuracy compared to models learned using only a single regularization hyperparameter.
Incremental Natural Actor-Critic Algorithms
Bhatnagar, Shalabh, Ghavamzadeh, Mohammad, Lee, Mark, Sutton, Richard S.
We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learningmethods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility withfunction approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variancein some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learningin the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.
Fast Variational Inference for Large-scale Internet Diagnosis
Kiciman, Emre, Maltz, David, Platt, John C.
Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 10^4 possible faults from 10^5 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a variational approximation, a mean-field approximation, and the use of stochastic gradient descent to optimize a variational cost function. We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service. The inference is fast enough to analyze network logs with billions of entries in a matter of hours.
Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
Lecchini-visintini, Andrea, Lygeros, John, Maciejowski, Jan
Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete combinatorial optimizationwhere the optimization variables can assume only a finite set of possible values. We introduce a new general formulation of simulated annealing whichallows one to guarantee finite-time performance in the optimization of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of finite-time learning with known accuracy and confidence developed in statistical learning theory.
Topmoumoute Online Natural Gradient Algorithm
Roux, Nicolas L., Manzagol, Pierre-antoine, Bengio, Yoshua
Guided by the goal of obtaining an optimization algorithm that is both fast and yielding good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.
Adaptive Online Gradient Descent
Hazan, Elad, Rakhlin, Alexander, Bartlett, Peter L.
We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.
Simulated annealing for weighted polygon packing
Xu, Yi-Chun, Xiao, Ren-Bin, Amos, Martyn
In this paper we present a new algorithm for a layout optimization problem: this concerns the placement of weighted polygons inside a circular container, the two objectives being to minimize imbalance of mass and to minimize the radius of the container. This problem carries real practical significance in industrial applications (such as the design of satellites), as well as being of significant theoretical interest. Previous work has dealt with circular or rectangular objects, but here we deal with the more realistic case where objects may be represented as polygons and the polygons are allowed to rotate. We present a solution based on simulated annealing and first test it on instances with known optima. Our results show that the algorithm obtains container radii that are close to optimal. We also compare our method with existing algorithms for the (special) rectangular case. Experimental results show that our approach out-performs these methods in terms of solution quality.
Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms
Semi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. In this paper, we deal with the less explored problem of learning the graphs. We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. We use a gradient based method and designed an efficient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. Experimental results show that the graph learning method is effective in improving the performance of the classification algorithm.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming
Weinberger, Kilian Q., Sha, Fei, Zhu, Qihui, Saul, Lawrence K.
In many areas of science and engineering, the problem arises how to discover low dimensional representations of high dimensional data. Recently, a number of researchers have converged on common solutions to this problem using methods from convex optimization. In particular, many results have been obtained by constructing semidefinite programs (SDPs) with low rank solutions. While the rank of matrix variables in SDPs cannot be directly constrained, it has been observed that low rank solutions emerge naturally by computing high variance or maximal trace solutions that respect local distance constraints. In this paper, we show how to solve very large problems of this type by a matrix factorization that leads to much smaller SDPs than those previously studied. The matrix factorization is derived by expanding the solution of the original problem in terms of the bottom eigenvectors of a graph Laplacian. The smaller SDPs obtained from this matrix factorization yield very good approximations to solutions of the original problem. Moreover, these approximations can be further refined by conjugate gradient descent. We illustrate the approach on localization in large scale sensor networks, where optimizations involving tens of thousands of nodes can be solved in just a few minutes.