Gradient Descent
Reviews: Ultrametric Fitting by Gradient Descent
Originality: For the aforementioned contributions, I believe this work provides a creative, unique approach to this problem. Quality: I believe this paper to be technically sound, a complete work that presents interesting approaches for hierarchical clustering. Clarity: The paper is written well and clearly explains the approach. But there were a some details that I thought could have been made clearer in both the presentation and in the experiments. Unless I've missed something, I think that it would be good to more clearly state the process (and its complexity) of going from the ultrametric fit to data to a dendrogram.
Reviews: Momentum-Based Variance Reduction in Non-Convex SGD
I agree with R3 that you did a poor job on relating your work to existing methods, in particular SARAH. Please also make sure that you carefully address the question of optimality. I also realized that your method in fact has nothing to do with momentum. Consider for instance deterministic objective, f(x, \xi) f(x). If one has a tight estimate, i.e. d_{t-1} abla f(x_{t-1}), then from your update rules it follows that d_t abla f(x_t), i.e. the method become gradient descent with no momentum! Your title, thus, is very confusing and I highly encourage you to change it.
Reviews: A Communication Efficient Stochastic Multi-Block Alternating Direction Method of Multipliers
This paper considers a communication efficient distributed optimization algorithm based on ADMM for stochastic optimization. The main idea is to perform multiple steps (can be timevarying) of stochastic gradient updates before the agents communicate, and therefore improving the communication efficiency. The proposed algorithm is shown to converge (in objective value & constraint violation) under a general non-smooth, non-strongly convex settings with O(1/eps) communication rounds and O(1/eps 2) unbiased gradient oracle calls. Other setting such as smooth strongly convex and non-smooth strongly convex are also analyzed and presented. Using multiple steps is however not novel.
Reviews: SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points
After response----I keep my evaluation on the technical innovation and suboptimality of this paper. The basic Spider and SpiderBoost algorithms are both for first-order stationary point, they are almost the same, and both give n {1/2} rate. The simple way to modify both algorithms to escape saddle point is to add Negative Curvature Search (NCS) subroutine (which can be done in a very modular way, and is already shown in the Spider paper). I'd say it's almost trivial to also show SpiderBoost NCS to find second-order stationary point with n {1/2} rate. Comparing this paper with SpiderBoost NCS, there's no improvement from n {2/3} to n {1/2} (since Spiderboost is already n {1/2}), no simplification of Spider (as Spiderboost already did so). The only difference is replacing NCS by perturbations, which again requires some work, but most techniques are already there.
Reviews: SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points
After extensive back and forth discussion by the reviewers, ultimately we felt that despite trivial solutions that are possible by mixing spider with negative curvature search, the approach of the present paper has its usefulness (as noted in the reviews too), and that the paper can be accepted. The authors are, however, strongly encouraged to look into the reviews carefully, and implement the points mentioned in the rebuttal -- because it would be a pity if after all the back-n-forth that this paper's review cycle has witnessed, if the authors did not update the paper to clarify its contribution, its value, and its contrasts with less implementable methods.
Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay
We prove the Fast Equilibrium Conjecture proposed by Li et al., (2020), i.e., stochastic gradient descent (SGD) on a scale-invariant loss (e.g., using networks with various normalization schemes) with learning rate \eta and weight decay factor \lambda mixes in function space in \mathcal{\tilde{O}}(\frac{1}{\lambda\eta}) steps, under two standard assumptions: (1) the noise covariance matrix is non-degenerate and (2) the minimizers of the loss form a connected, compact and analytic manifold. The analysis uses the framework of Li et al., (2021) and shows that for every T 0, the iterates of SGD with learning rate \eta and weight decay factor \lambda on the scale-invariant loss converge in distribution in \Theta\left(\eta {-1}\lambda {-1}(T \ln(\lambda/\eta))\right) iterations as \eta\lambda\to 0 while satisfying \eta \le O(\lambda)\le O(1) . Moreover, the evolution of the limiting distribution can be described by a stochastic differential equation that mixes to the same equilibrium distribution for every initialization around the manifold of minimizers as T\to\infty .
Reviews: First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise
For Reviewer #1's concern about making theory, I tend to be open-minded since I can not find solid evidence that the paper is making theory only. For Reviewer #4's comment about the over-claim of the result the paper proved, my take is follows. First, for many problems, the true local minima enjoys the flat basin. A famous example I have is the following paper: McGoff, Kevin A., et al. "The Local Edge Machine: inference of dynamic models of gene regulation." Second, the authors have explained the motivation of using the Levy process to model the noise.