AITopics | Gradient Descent

In many learning problems, ranging from clustering to ranking through metric learning, empirical estimates of the risk functional consist of an average over tu-ples ( e.g., pairs or triplets) of observations, rather than over individual observations. In this paper, we focus on how to best implement a stochastic approximation approach to solve such risk minimization problems. We argue that in the large-scale setting, gradient estimates should be obtained by sampling tuples of data points with replacement ( incomplete U -statistics) instead of sampling data points without replacement ( complete U -statistics based on subsamples). We develop a theoretical framework accounting for the substantial impact of this strategy on the generalization ability of the prediction model returned by the Stochastic Gradient Descent (SGD) algorithm. It reveals that the method we promote achieves a much better trade-off between statistical accuracy and computational cost. Beyond the rate bound analysis, experiments on AUC maximization and metric learning provide strong empirical evidence of the superiority of the proposed approach.

artificial intelligence, machine learning, variance, (15 more...)

Neural Information Processing Systems

Country: Europe > France > Île-de-France > Paris > Paris (0.04)

Industry: Education (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

Xiangru Lian, Yijun Huang, Yuncheng Li, Ji Liu

Neural Information Processing SystemsOct-2-2025, 05:31:14 GMT

Asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and speedup properties, mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism. To fill the gaps in theory and provide theoretical supports, this paper studies two asynchronous parallel implementations of SG: one is over a computer network and the other is on a shared memory system. We establish an ergodic convergence rate O (1 / K) for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by K ( K is the total number of iterations). Our results generalize and improve existing analysis for convex minimization.

algorithm, artificial intelligence, machine learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout

Neural Information Processing SystemsOct-2-2025, 05:22:20 GMT

The vast majority of deep models use multiple gradient signals, typically corresponding to a sum of multiple loss terms, to update a shared set of trainable weights.

artificial intelligence, deep learning, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America (0.15)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Equilibrated adaptive learning rates for non-convex optimization

Yann Dauphin, Harm de Vries, Yoshua Bengio

Neural Information Processing SystemsOct-2-2025, 05:10:53 GMT

Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes. We show that the popular Jacobi preconditioner has undesirable behavior in the presence of both positive and negative curvature, and present theoretical and empirical evidence that the so-called equilibration preconditioner is comparatively better suited to non-convex problems. We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

condition number, matrix, preconditioner, (15 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Global Convergence of Gradient Descent for Deep Linear Residual Networks

Lei Wu, Qingcan Wang, Chao Ma

Neural Information Processing SystemsOct-2-2025, 04:41:40 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, initialization, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.59)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Efficiently avoiding saddle points with zero order methods: No gradients required

Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, Georgios Piliouras

Neural Information Processing SystemsOct-2-2025, 03:41:34 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, saddle point, (15 more...)

Neural Information Processing Systems

Country:

Europe (1.00)
North America > Canada (0.69)
Asia > Middle East (0.68)
North America > United States > California (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.33)

Add feedback

A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements

Qinqing Zheng, John Lafferty

Neural Information Processing SystemsOct-2-2025, 03:22:39 GMT

We propose a simple, scalable, and fast gradient descent algorithm to optimize a nonconvex objective for the rank minimization problem and a closely related family of semidefinite programs.

algorithm, minimization, optimization, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)

Add feedback

Learning with Incremental Iterative Regularization

Lorenzo Rosasco, Silvia Villa

Neural Information Processing SystemsOct-2-2025, 00:51:58 GMT

Within a statistical learning setting, we propose and study an iterative regularization algorithm for least squares defined by an incremental gradient method. In particular, we show that, if all other parameters are fixed a priori, the number of passes over the data (epochs) acts as a regularization parameter, and prove strong universal consistency, i.e. almost sure convergence of the risk, as well as sharp finite sample bounds for the iterates. Our results are a step towards understanding the effect of multiple epochs in stochastic gradient techniques in machine learning and rely on integrating statistical and optimization results.

algorithm, assumption, iteration, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report > New Finding (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback