AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Review for NeurIPS paper: Debiasing Averaged Stochastic Gradient Descent to handle missing values

Neural Information Processing SystemsJan-26-2025, 19:13:59 GMT

A key quantity in the analysis is p_j, the probability of missing j-th feature. Is this quantity given in advanced? Under MCAR, it can be easily estimated. But some clarifications are needed. Essentially, this method is just SGD IPW (though it is coordinate-wise IPW not the regular IPW).

debiasing averaged stochastic gradient descent, ipw, neurips paper, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Add feedback

Review for NeurIPS paper: Debiasing Averaged Stochastic Gradient Descent to handle missing values

Neural Information Processing SystemsJan-26-2025, 19:04:15 GMT

The paper addresses the issue of missing values in SGD. There was agreement that the obtained results are novel but there was no strong advocate for the paper.

debiasing averaged stochastic gradient descent, neurips paper

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

Reviews: Ultrametric Fitting by Gradient Descent

Neural Information Processing SystemsJan-26-2025, 16:12:18 GMT

Originality: For the aforementioned contributions, I believe this work provides a creative, unique approach to this problem. Quality: I believe this paper to be technically sound, a complete work that presents interesting approaches for hierarchical clustering. Clarity: The paper is written well and clearly explains the approach. But there were a some details that I thought could have been made clearer in both the presentation and in the experiments. Unless I've missed something, I think that it would be good to more clearly state the process (and its complexity) of going from the ultrametric fit to data to a dendrogram.

experiment, gradient descent, ultrametric fitting, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)

Add feedback

Reviews: Ultrametric Fitting by Gradient Descent

Neural Information Processing SystemsJan-26-2025, 15:56:58 GMT

The authors develop a nice method for fitting a hierarchical clustering by solving an optimization problem that leads to an ultrametric (and thus a hierarchical clustering). The reviewers are in agreement that this work should be accepted.

gradient descent, ultrametric fitting

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Reviews: Momentum-Based Variance Reduction in Non-Convex SGD

Neural Information Processing SystemsJan-26-2025, 15:54:52 GMT

I agree with R3 that you did a poor job on relating your work to existing methods, in particular SARAH. Please also make sure that you carefully address the question of optimality. I also realized that your method in fact has nothing to do with momentum. Consider for instance deterministic objective, f(x, \xi) f(x). If one has a tight estimate, i.e. d_{t-1} abla f(x_{t-1}), then from your update rules it follows that d_t abla f(x_t), i.e. the method become gradient descent with no momentum! Your title, thus, is very confusing and I highly encourage you to change it.

experiment, momentum-based variance reduction, non-convex sgd, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.39)

Add feedback

Reviews: A Communication Efficient Stochastic Multi-Block Alternating Direction Method of Multipliers

Neural Information Processing SystemsJan-26-2025, 12:54:24 GMT

This paper considers a communication efficient distributed optimization algorithm based on ADMM for stochastic optimization. The main idea is to perform multiple steps (can be timevarying) of stochastic gradient updates before the agents communicate, and therefore improving the communication efficiency. The proposed algorithm is shown to converge (in objective value & constraint violation) under a general non-smooth, non-strongly convex settings with O(1/eps) communication rounds and O(1/eps 2) unbiased gradient oracle calls. Other setting such as smooth strongly convex and non-smooth strongly convex are also analyzed and presented. Using multiple steps is however not novel.

algorithm, optimization, stochastic multi-block alternating direction method, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.58)

Add feedback

Reviews: SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

Neural Information Processing SystemsJan-26-2025, 10:25:51 GMT

After response----I keep my evaluation on the technical innovation and suboptimality of this paper. The basic Spider and SpiderBoost algorithms are both for first-order stationary point, they are almost the same, and both give n {1/2} rate. The simple way to modify both algorithms to escape saddle point is to add Negative Curvature Search (NCS) subroutine (which can be done in a very modular way, and is already shown in the Spider paper). I'd say it's almost trivial to also show SpiderBoost NCS to find second-order stationary point with n {1/2} rate. Comparing this paper with SpiderBoost NCS, there's no improvement from n {2/3} to n {1/2} (since Spiderboost is already n {1/2}), no simplification of Spider (as Spiderboost already did so). The only difference is replacing NCS by perturbations, which again requires some work, but most techniques are already there.

escape saddle point, escaping saddle point, simple stochastic recursive gradient descent, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Reviews: SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

Neural Information Processing SystemsJan-26-2025, 10:25:41 GMT

After extensive back and forth discussion by the reviewers, ultimately we felt that despite trivial solutions that are possible by mixing spider with negative curvature search, the approach of the present paper has its usefulness (as noted in the reviews too), and that the paper can be accepted. The authors are, however, strongly encouraged to look into the reviews carefully, and implement the points mentioned in the rebuttal -- because it would be a pity if after all the back-n-forth that this paper's review cycle has witnessed, if the authors did not update the paper to clarify its contribution, its value, and its contrasts with less implementable methods.

escaping saddle point, simple stochastic recursive gradient descent, ssrgd

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay

Neural Information Processing SystemsJan-26-2025, 10:07:01 GMT

We prove the Fast Equilibrium Conjecture proposed by Li et al., (2020), i.e., stochastic gradient descent (SGD) on a scale-invariant loss (e.g., using networks with various normalization schemes) with learning rate \eta and weight decay factor \lambda mixes in function space in \mathcal{\tilde{O}}(\frac{1}{\lambda\eta}) steps, under two standard assumptions: (1) the noise covariance matrix is non-degenerate and (2) the minimizers of the loss form a connected, compact and analytic manifold. The analysis uses the framework of Li et al., (2021) and shows that for every T 0, the iterates of SGD with learning rate \eta and weight decay factor \lambda on the scale-invariant loss converge in distribution in \Theta\left(\eta {-1}\lambda {-1}(T \ln(\lambda/\eta))\right) iterations as \eta\lambda\to 0 while satisfying \eta \le O(\lambda)\le O(1) . Moreover, the evolution of the limiting distribution can be described by a stochastic differential equation that mixes to the same equilibrium distribution for every initialization around the manifold of minimizers as T\to\infty .

fast mixing, normalization and weight decay, stochastic gradient descent, (4 more...)

Neural Information Processing Systems

Genre: Play > Prospect (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Reviews: First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing SystemsJan-26-2025, 08:24:01 GMT

For Reviewer #1's concern about making theory, I tend to be open-minded since I can not find solid evidence that the paper is making theory only. For Reviewer #4's comment about the over-claim of the result the paper proved, my take is follows. First, for many problems, the true local minima enjoys the flat basin. A famous example I have is the following paper: McGoff, Kevin A., et al. "The Local Edge Machine: inference of dynamic models of gene regulation." Second, the authors have explained the motivation of using the Levy process to model the noise.

exit time analysis, heavy-tailed gradient noise, stochastic gradient descent, (6 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.38)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback