Goto

Collaborating Authors

 Gradient Descent


Diving into the shallows: a computational perspective on large-scale shallow learning

arXiv.org Machine Learning

In this paper we first identify a basic limitation in gradient descent-based optimization methods when used in conjunctions with smooth kernels. An analysis based on the spectral properties of the kernel demonstrates that only a vanishingly small portion of the function space is reachable after a polynomial number of gradient descent iterations. This lack of approximating power drastically limits gradient descent for a fixed computational budget leading to serious over-regularization/underfitting. The issue is purely algorithmic, persisting even in the limit of infinite data. To address this shortcoming in practice, we introduce EigenPro iteration, based on a preconditioning scheme using a small number of approximately computed eigenvectors. It can also be viewed as learning a new kernel optimized for gradient descent. It turns out that injecting this small (computationally inexpensive and SGD-compatible) amount of approximate second-order information leads to major improvements in convergence. For large data, this translates into significant performance boost over the standard kernel methods. In particular, we are able to consistently match or improve the state-of-the-art results recently reported in the literature with a small fraction of their computational budget. Finally, we feel that these results show a need for a broader computational perspective on modern large-scale learning to complement more traditional statistical and convergence analyses. In particular, many phenomena of large-scale high-dimensional inference are best understood in terms of optimization on infinite dimensional Hilbert spaces, where standard algorithms can sometimes have properties at odds with finite-dimensional intuition. A systematic analysis concentrating on the approximation power of such algorithms within a budget of computation may lead to progress both in theory and practice.


Second-Order Kernel Online Convex Optimization with Adaptive Sketching

arXiv.org Machine Learning

Kernel online convex optimization (KOCO) is a framework combining the expressiveness of non-parametric kernel models with the regret guarantees of online learning. First-order KOCO methods such as functional gradient descent require only $\mathcal{O}(t)$ time and space per iteration, and, when the only information on the losses is their convexity, achieve a minimax optimal $\mathcal{O}(\sqrt{T})$ regret. Nonetheless, many common losses in kernel problems, such as squared loss, logistic loss, and squared hinge loss posses stronger curvature that can be exploited. In this case, second-order KOCO methods achieve $\mathcal{O}(\log(\text{Det}(\boldsymbol{K})))$ regret, which we show scales as $\mathcal{O}(d_{\text{eff}}\log T)$, where $d_{\text{eff}}$ is the effective dimension of the problem and is usually much smaller than $\mathcal{O}(\sqrt{T})$. The main drawback of second-order methods is their much higher $\mathcal{O}(t^2)$ space and time complexity. In this paper, we introduce kernel online Newton step (KONS), a new second-order KOCO method that also achieves $\mathcal{O}(d_{\text{eff}}\log T)$ regret. To address the computational complexity of second-order methods, we introduce a new matrix sketching algorithm for the kernel matrix $\boldsymbol{K}_t$, and show that for a chosen parameter $\gamma \leq 1$ our Sketched-KONS reduces the space and time complexity by a factor of $\gamma^2$ to $\mathcal{O}(t^2\gamma^2)$ space and time per iteration, while incurring only $1/\gamma$ times more regret.


Variable Annealing Length and Parallelism in Simulated Annealing

AAAI Conferences

In this paper, we propose: (a) a restart schedule for an adaptive simulated annealer, and (b) parallel simulated annealing, with an adaptive and parameter-free annealing schedule. The foundation of our approach is the Modified Lam annealing schedule, which adaptively controls the temperature parameter to track a theoretically ideal rate of acceptance of neighboring states. A sequential implementation of Modified Lam simulated annealing is almost parameter-free. However, it requires prior knowledge of the annealing length. We eliminate this parameter using restarts, with an exponentially increasing schedule of annealing lengths. We then extend this restart schedule to parallel implementation, executing several Modified Lam simulated annealers in parallel, with varying initial annealing lengths, and our proposed parallel annealing length schedule. To validate our approach, we conduct experiments on an NP-Hard scheduling problem with sequence-dependent setup constraints. We compare our approach to fixed length restarts, both sequentially and in parallel. Our results show that our approach can achieve substantial performance gains, throughout the course of the run, demonstrating our approach to be an effective anytime algorithm.


Provable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations

arXiv.org Machine Learning

Non-negative matrix factorization is a basic tool for decomposing data into the feature and weight matrices under non-negativity constraints, and in practice is often solved in the alternating minimization framework. However, it is unclear whether such algorithms can recover the ground-truth feature matrix when the weights for different features are highly correlated, which is common in applications. This paper proposes a simple and natural alternating gradient descent based algorithm, and shows that with a mild initialization it provably recovers the ground-truth in the presence of strong correlations. In most interesting cases, the correlation can be in the same order as the highest possible. Our analysis also reveals its several favorable features including robustness to noise. We complement our theoretical results with empirical studies on semi-synthetic datasets, demonstrating its advantage over several popular methods in recovering the ground-truth.


Homotopy Analysis for Tensor PCA

arXiv.org Machine Learning

Non-convex optimization is a critical component in modern machine learning. Unfortunately, theoretical guarantees for nonconvex optimization have been mostly negative, and the problems are computationally hard in the worst case. Nevertheless, simple local-search algorithms such as stochastic gradient descent have enjoyed great empirical success in areas such as deep learning. As such, recent research efforts have attempted to bridge this gap between theory and practice. For example, one property that can guarantee the success of local search methods over nonconvex functions is when all local minima are also the global minima. Interestingly, it has been recently proven that many well known nonconvex problems do have this property, under mild conditions. Consequently, local-search methods, which are designed to find a local optimum, automatically achieve global optimality. Examples of such problems include matrix completion [1], orthogonal tensor decomposition [2, 3], phase retrieval [4], complete dictionary learning [5], and so on. However, such a class of nonconvex problems is limited, and there are many practical problems with poor local optima, where local search methods can fail.


Learning to Learn without Gradient Descent by Gradient Descent

arXiv.org Machine Learning

We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the training horizon, the learned optimizers learn to trade-off exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyper-parameter tuning.


A Well-Tempered Landscape for Non-convex Robust Subspace Recovery

arXiv.org Machine Learning

We present a mathematical analysis of a non-convex energy landscape for Robust Subspace Recovery. We prove that an underlying subspace is the only stationary point and local minimizer in a large neighborhood if a generic condition holds for a dataset. We further show that if the generic condition is satisfied, a geodesic gradient descent method over the Grassmannian manifold can exactly recover the underlying subspace with proper initialization. The condition is shown to hold with high probability for a certain model of data.


Types of Optimization Algorithms used in Neural Networks and Ways to Optimize Gradient Descent

#artificialintelligence

Have you ever wondered which optimization algorithm to use for your Neural network Model to produce slightly better and faster results by updating the Model parameters such as Weights and Bias values . Should we use Gradient Descent or Stochastic gradient Descent or Adam? I too didn't know about the major differences between these different types of Optimization Strategies and which one is better over another before writing this article. Optimization algorithms helps us to minimize (or maximize) a Loss function (another name for Error function) E(x) which is simply a mathematical function dependent on the Model's internal parameters which are used in computing the target values(Y) from the set of predictors(X) used in the model. For example -- we call the Weights(W) and the Bias(b) values of the Neural Network as its internal parameters which are used in computing the Output values and play a major role in the training process of the Neural Network Model .


An Alternative to EM for Gaussian Mixture Models: Batch and Stochastic Riemannian Optimization

arXiv.org Machine Learning

We consider maximum likelihood estimation for Gaussian Mixture Models (Gmms). This task is almost invariably solved (in theory and practice) via the Expectation Maximization (EM) algorithm. EM owes its success to various factors, of which is its ability to fulfill positive definiteness constraints in closed form is of key importance. We propose an alternative to EM by appealing to the rich Riemannian geometry of positive definite matrices, using which we cast Gmm parameter estimation as a Riemannian optimization problem. Surprisingly, such an out-of-the-box Riemannian formulation completely fails and proves much inferior to EM. This motivates us to take a closer look at the problem geometry, and derive a better formulation that is much more amenable to Riemannian optimization. We then develop (Riemannian) batch and stochastic gradient algorithms that outperform EM, often substantially. We provide a non-asymptotic convergence analysis for our stochastic method, which is also the first (to our knowledge) such global analysis for Riemannian stochastic gradient. Numerous empirical results are included to demonstrate the effectiveness of our methods.


Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

arXiv.org Machine Learning

Asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and speedup properties, mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism. To fill the gaps in theory and provide theoretical supports, this paper studies two asynchronous parallel implementations of SG: one is on the computer network and the other is on the shared memory system. We establish an ergodic convergence rate $O(1/\sqrt{K})$ for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by $\sqrt{K}$ ($K$ is the total number of iterations). Our results generalize and improve existing analysis for convex minimization.