AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Large-Margin Convex Polytope Machine

Kantchelian, Alex, Tschantz, Michael C., Huang, Ling, Bartlett, Peter L., Joseph, Anthony D., Tygar, J. D.

Neural Information Processing SystemsFeb-14-2020, 12:11:40 GMT

We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive datasets, and augment it with a heuristic procedure to avoid sub-optimal local minima. Our experimental evaluations of the CPM on large-scale datasets from distinct domains (MNIST handwritten digit recognition, text topic, and web security) demonstrate that the CPM trains models faster, sometimes several orders of magnitude, than state-of-the-art similar approaches and kernel-SVM methods while achieving comparable or better classification performance. Our empirical results suggest that, unlike prior similar approaches, we do not need to control the number of sub-classifiers (sides of the polytope) to avoid overfitting.

algorithm, convex polytope machine, large-margin convex polytope machine, (1 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)

Add feedback

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Chizat, Lénaïc, Bach, Francis

Neural Information Processing SystemsFeb-14-2020, 11:58:20 GMT

Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers.

global convergence, optimal transport, over-parameterized model, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)

Add feedback

Bayesian Sampling Using Stochastic Gradient Thermostats

Ding, Nan, Fang, Youhan, Babbush, Ryan, Chen, Changyou, Skeel, Robert D., Neven, Hartmut

Neural Information Processing SystemsFeb-14-2020, 11:57:06 GMT

Dynamics-based sampling methods, such as Hybrid Monte Carlo (HMC) and Langevin dynamics (LD), are commonly used to sample target distributions. Recently, such approaches have been combined with stochastic gradient techniques to increase sampling efficiency when dealing with large datasets. An outstanding problem with this approach is that the stochastic gradient introduces an unknown amount of noise which can prevent proper sampling after discretization. To remedy this problem, we show that one can leverage a small number of additional variables in order to stabilize momentum fluctuations induced by the unknown noise. Our method is inspired by the idea of a thermostat in statistical physics and is justified by a general theory.

bayesian sampling, noise, stochastic gradient thermostat

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)

Add feedback

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Tripuraneni, Nilesh, Stern, Mitchell, Jin, Chi, Regier, Jeffrey, Jordan, Michael I.

Neural Information Processing SystemsFeb-14-2020, 11:43:57 GMT

This paper proposes a stochastic variant of a classic algorithm---the cubic-regularized Newton method [Nesterov and Polyak]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only $\mathcal{\tilde{O}}(\epsilon {-3.5})$ stochastic gradient and stochastic Hessian-vector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the $\mathcal{\tilde{O}}(\epsilon {-4})$ rate of stochastic gradient descent. Our rate matches the best-known result for finding local minima without requiring any delicate acceleration or variance-reduction techniques.

fast nonconvex optimization, local minima, stochastic cubic regularization, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent

London, Ben

Neural Information Processing SystemsFeb-14-2020, 11:43:21 GMT

We study the generalization error of randomized learning algorithms -- focusing on stochastic gradient descent (SGD) -- using a novel combination of PAC-Bayes and algorithmic stability. Importantly, our generalization bounds hold for all posterior distributions on an algorithm's random hyperparameters, including distributions that depend on the training data. We analyze this algorithm in the context of our generalization bounds and evaluate it on a benchmark dataset. Our experiments demonstrate that adaptive sampling can reduce empirical risk faster than uniform sampling while also improving out-of-sample accuracy. Papers published at the Neural Information Processing Systems Conference.

pac-bayesian analysis, randomized learning, stochastic gradient descent, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Scalable Kernel Methods via Doubly Stochastic Gradients

Dai, Bo, Xie, Bo, He, Niao, Liang, Yingyu, Raj, Anant, Balcan, Maria-Florina F., Song, Le

Neural Information Processing SystemsFeb-14-2020, 11:41:46 GMT

The general perception is that kernel methods are not scalable, so neural nets become the choice for large-scale nonlinear learning problems. Have we tried hard enough for kernel methods? In this paper, we propose an approach that scales up kernel methods using a novel concept called doubly stochastic functional gradients''. Based on the fact that many kernel methods can be expressed as convex optimization problems, our approach solves the optimization problems by making two unbiased stochastic approximations to the functional gradient---one using random training points and another using random features associated with the kernel---and performing descent steps with this noisy functional gradient. Our algorithm is simple, need no commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting.

doubly stochastic gradient, kernel method, scalable kernel method, (4 more...)

Neural Information Processing Systems

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Kernel Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients

Xie, Bo, Liang, Yingyu, Song, Le

Neural Information Processing SystemsFeb-14-2020, 11:27:01 GMT

Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, but they can not scale up to big datasets. Recent attempts have employed random feature approximations to convert the problem to the primal form for linear computational complexity. However, to obtain high quality solutions, the number of random features should be the same order of magnitude as the number of data points, making such approach not directly applicable to the regime with millions of data points.We propose a simple, computationally efficient, and memory friendly algorithm based on the doubly stochastic gradients'' to scale up a range of kernel nonlinear component analysis, such as kernel PCA, CCA and SVD. Despite the \emph{non-convex} nature of these problems, our method enjoys theoretical guarantees that it converges at the rate $\Otil(1/t)$ to the global optimum, even for the top $k$ eigen subspace. Unlike many alternatives, our algorithm does not require explicit orthogonalization, which is infeasible on big datasets. We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets.

component analysis, doubly stochastic gradient, nonlinear component analysis, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

Variance Reduced Stochastic Gradient Descent with Neighbors

Hofmann, Thomas, Lucchi, Aurelien, Lacoste-Julien, Simon, McWilliams, Brian

Neural Information Processing SystemsFeb-14-2020, 11:26:00 GMT

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet it is also known to be slow relative to steepest descent. Recently, variance reduction techniques such as SVRG and SAGA have been proposed to overcome this weakness. With asymptotically vanishing variance, a constant step size can be maintained, resulting in geometric convergence rates. However, these methods are either based on occasional computations of full gradients at pivot points (SVRG), or on keeping per data point corrections in memory (SAGA). This has the disadvantage that one cannot employ these methods in a streaming setting and that speed-ups relative to SGD may need a certain number of epochs in order to materialize.

algorithm, stochastic gradient descent, variance reduced stochastic gradient descent, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Liu, Qiang, Wang, Dilin

Neural Information Processing SystemsFeb-14-2020, 11:13:18 GMT

We propose a general purpose variational inference algorithm that forms a natural counterpart of gradient descent for optimization. Our method iteratively transports a set of particles to match the target distribution, by applying a form of functional gradient descent that minimizes the KL divergence. Empirical studies are performed on various real world models and datasets, on which our method is competitive with existing state-of-the-art methods. The derivation of our method is based on a new theoretical result that connects the derivative of KL divergence under smooth transforms with Stein's identity and a recently proposed kernelized Stein discrepancy, which is of independent interest. Papers published at the Neural Information Processing Systems Conference.

general purpose bayesian inference algorithm, kl divergence, stein variational gradient descent

Neural Information Processing Systems

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)

Add feedback

Natasha 2: Faster Non-Convex Optimization Than SGD

Allen-Zhu, Zeyuan

Neural Information Processing SystemsFeb-14-2020, 11:13:15 GMT

We design a stochastic algorithm to find $\varepsilon$-approximate local minima of any smooth nonconvex function in rate $O(\varepsilon {-3.25})$, with only oracle access to stochastic gradients. The best result before this work was $O(\varepsilon {-4})$ by stochastic gradient descent (SGD). Papers published at the Neural Information Processing Systems Conference.

faster non-convex optimization, natasha 2, varepsilon, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback