AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Beyond Convexity: Stochastic Quasi-Convex Optimization

Elad Hazan, Kfir Levy, Shai Shalev-Shwartz

Neural Information Processing SystemsOct-2-2025, 11:08:09 GMT

Stochastic convex optimization is a basic and well studied primitive in machine learning. It is well known that convex and Lipschitz functions can be minimized efficiently using Stochastic Gradient Descent (SGD). The Normalized Gradient Descent (NGD) algorithm, is an adaptation of Gradient Descent, which updates according to the direction of the gradients, rather than the gradients themselves. In this paper we analyze a stochastic version of NGD and prove its convergence to a global minimum for a wider class of functions: we require the functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens the concept of unimodality to multidimensions and allows for certain types of saddle points, which are a known hurdle for first-order optimization methods such as gradient descent. Locally-Lipschitz functions are only required to be Lipschitz in a small region around the optimum. This assumption circumvents gradient explosion, which is another known hurdle for gradient descent variants. Interestingly, unlike the vanilla SGD algorithm, the stochastic normalized gradient descent algorithm provably requires a minimal minibatch size.

artificial intelligence, gradient, machine learning, (16 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Stochastic Variance Reduced Primal Dual Algorithms for Empirical Composition Optimization

Adithya M Devraj, Jianshu Chen

Neural Information Processing SystemsOct-2-2025, 09:57:45 GMT

The work was done during an internship at Tencent AI Lab, Bellevue, W A.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.31)

Add feedback

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Neural Information Processing SystemsOct-2-2025, 09:45:37 GMT

We propose to use line-search techniques to automatically set the step-size when training models that can interpolate the data.

artificial intelligence, interpolation, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America > Canada (0.29)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn

Neural Information Processing SystemsOct-2-2025, 09:38:00 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, artificial intelligence, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.15)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.31)

Add feedback

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets

Pascal Vincent, Alexandre de Brébisson, Xavier Bouthillier

Neural Information Processing SystemsOct-2-2025, 09:31:37 GMT

An important class of problems involves training deep neural networks with sparse prediction targets of very high dimension D . These occur naturally in e.g.

artificial intelligence, gradient update, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > Singapore (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)

Add feedback

Content 14

Neural Information Processing SystemsOct-2-2025, 09:13:20 GMT

First we start with the reparametrized Projected Gradient Descent algorithm. The update rule for g follows directly. Suppose that the loss does not converge to zero. Now, in general, to avoid converging to this set, we must make some additional assumptions on the initialization. However, it is also more general.

artificial intelligence, machine learning, matrix, (19 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients

Bo Xie, Yingyu Liang, Le Song

Neural Information Processing SystemsOct-2-2025, 09:11:36 GMT

Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, but they cannot scale up to big datasets. Recent attempts have employed random feature approximations to convert the problem to the primal form for linear computational complexity. However, to obtain high quality solutions, the number of random features should be the same order of magnitude as the number of data points, making such approach not directly applicable to the regime with millions of data points. We propose a simple, computationally efficient, and memory friendly algorithm based on the "doubly stochastic gradients" to scale up a range of kernel nonlinear component analysis, such as kernel PCA, CCA and SVD. Despite the non-convex nature of these problems, our method enjoys theoretical guarantees that it converges at the rate O (1 /t) to the global optimum, even for the top k eigen subspace. Unlike many alternatives, our algorithm does not require explicit orthogonaliza-tion, which is infeasible on big datasets. We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets.

algorithm, dataset, random feature, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)

Add feedback

Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent

Wenqing Hu, Chris Junchi Li, Xiangru Lian, Ji Liu, Huizhuo Yuan

Neural Information Processing SystemsOct-2-2025, 09:03:42 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

Add feedback

Supplementary File for " Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes "

Neural Information Processing SystemsOct-2-2025, 08:53:06 GMT

The supplementary file is organized as follows: Section 1 restates the assumptions and main theorems on the convergence of parameter iterates and the full gradient; Section 2 is devoted to the proofs of the two main theorems, while Section 3 includes the proofs of supporting lemmas; Section 4 includes additional figures from the numerical study. Under Assumptions 1.1 to 1.3, when m > C for some constant C > 0, we have the following results under two corresponding conditions on s First we present the following lemma, showing that the loss function has a property similar from strong convexity. For the first case discussed in Lemma 2.1, define null g (θ ( k 1) (k 1) ( k 1) (k 1) ( k 1) (k 1) Therefore, combining Lemma 2.1, Lemma 2.2 and (7) leads to the following conclusion. Proof of Theorem 2. We start from bounding null Under this case, we can still apply (15) in Lemma 2.3. The following proof of this claim is very similar to the proof of Lemma 5.2 in [2].

artificial intelligence, machine learning, probability, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)

Add feedback