Goto

Collaborating Authors

 Gradient Descent


Reviews: L4: Practical loss-based stepsize adaptation for deep learning

Neural Information Processing Systems

The paper proposes a scheme for adaptive choice of learning rate for stochastic gradients descent and its variants. The key idea is very simple and easy to implement: given the loss value L at the global minimum, L_min, the idea is to choose learning rate eta, such that the update along the gradient reaches L_min from the current point i.e. solving L(theta-eta*v) L_min in eta, where v is for example dL/dtheta in gradient descent. Finally, to make the adaptive learning rate pessimistic to the possible linearization error, the authors introduce a coefficient alpha, so the effective learning rate used by the optimizer is eta*alpha. The authors empirically show (on badly conditioned regression, MNIST, CIFAR-10, and neural computer) that using such adaptive scheme helps in two ways: 1. the optimization performance is less sensitive to the choice of the coefficient alpha vs the learning rate (in non-adaptive setting), and 2. the optimizer can reduce the loss faster or at worst in equal speed with commonly used optimizers. At the same time, the paper has some shortcomings as admitted by the authors: 1.


Reviews: VAE Learning via Stein Variational Gradient Descent

Neural Information Processing Systems

However I feel that the presentation could be improved and more details about the (complicated) implementation should have been included. For instance, first p(x z_n ; \theta) is called the "encoder", then q(z x ; \phi) is called the "encoder" - which is it, and why is it given this name?


Reviews: Stochastic Chebyshev Gradient Descent for Spectral Optimization

Neural Information Processing Systems

Spectral optimization is defined as finding \theta that minimizes F(A(\theta)) g(\theta) where A(\theta) is a symmetric matrix and F typically the trace of an analytic function i.e. F(A) tr(p(A)) where p is a polynomial. They propose an unbiased estimator of F by randomly truncating the Chebyshev approximation to F and doing importance sampling. Moreover, they calculate the optimal distribution for this importance sampling. They demonstrate how this method would be used for SGD and stochastic Variance Reduced Gradient.


Reviews: Bayesian Distributed Stochastic Gradient Descent

Neural Information Processing Systems

Summary: This paper presents a new algorithm called Bayesian Distributed SGD to mitigate the straggler problem when training deep learning on parallel clusters. Unlike Synchronous Distributed SGD approach where a fixed cut-off (number of workers) is predefined, BDSGD uses amortized inference to predict workers' run-times and derive a straggler cut-off accordingly. BDSGD models the joint run-time behaviour of workers which are likely to be correlated due to the underlying cluster architecture. The approach is incorporated as part of the parameter server framework, deciding which sub-gradients to drop in each iteration. Strength: The proposed idea of adaptive cut-off and predicting joint worker runtime through amortized inference with variational auto encoder loss is novel and very interesting.


Reviews: Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation

Neural Information Processing Systems

The main idea is incorporating Nesterov's accelerated gradient descent (AGD) in eigenvalue problem. The approach relies on shift-and-invert preconditioning method that reduces the non-convex objective of Rayleigh quotient to a sequence of convex programs. Shift-and-invert preconditioning improves the convergence dependency of the gradient method to the eigengap of the given matrix. The focus of this paper is using AGD method to approximately solve the convex programs and reaching an accelerated convergence rate for the convex part. Exploiting the accelerated convergence of AGD, they reach an accelerated convergence for the first-order optimization of the eigenvalue problem.


Reviews: PCA of high dimensional random walks with comparison to neural network training

Neural Information Processing Systems

Motivated by the problem of visualizing the loss landscape of a deep neural networks during training, and the heuristic consisting of performing PCA on the set of parameters output by stochastic gradient descent, this paper considers a simplified model where the data comes from a simple random walk in Euclidean space instead of a NN. The authors attempt to justify this heuristic by asking what the projection of this walk on its first few principal components should look like. An asymptotic analysis is performed and the conclusion that most of the variance of the walk is captured by its first few components is reached. This reasoning is then extended to the case of a discrete Ornstein-Uhlenbeck process. Then the authors show that their findings are reasonably accurate compared to data coming from a NN on real-world datasets. The idea of performing PCA on the output of a NN for purposes of visualization is an interesting one, and this paper makes a first step towards understanding this proposal through the analysis of a very simple model.


Reviews: Stein Variational Gradient Descent as Gradient Flow

Neural Information Processing Systems

The paper provides asymptotic convergence results, both in the large-particle and large-time limits. The paper also investigates the continuous-time limit of SVGD, which results in a PDE that has the flavor of a deterministic Fokker-Planck equation. Finally, the paper offers a geometric perspective, interpreting the continuous-time process as a gradient flow and introducing a novel optimal transport metric along the way. Overall, this is a very nice paper with some insightful results. However, there are a few important technical issues that prevent me from recommending publication.


Reviews: Learning with SGD and Random Features

Neural Information Processing Systems

I have updated my score to an 8 accordingly. I think the planned updates to the empirical section will add a lot of value. Summary: This paper analyzes the generalization performance of models trained using mini-batch stochastic gradient methods with random features (eg, for kernel approximation), for regression tasks using the least squares loss. Their main theorem (Theorem 1) bounds the gap in generalization performance between the lowest risk model in the RKHS with the SGD trained model after t mini-batch updates; the bound is in terms of the learning rate, the mini-batch size, the number of random features M, the training set size n, and t. They show that under certain choices of these parameters, M O(sqrt(n)) features are sufficient to guarantee that the gap is 1/sqrt(n).


Reviews: New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity

Neural Information Processing Systems

This paper lies in the well-developed field of variance-reduced stochastic gradient algorithms. It proposes a theoretical study of the well-known HSGD for sampling without replacement scheme, which is known to perform better than sampling with replacement in practice. The considered setting makes the analysis of the algorithm more difficult, since in this case the mini-batch gradients are not unbiased anymore. Rates for strongly-convex, non-strongly convex and non-convex objectives are provided, with a special care for linearly structured problems (such as generalized linear models). Numerical experiments are convincing enough, although the datasets used in experiments are not very large (largest one is covtype) (while authors argue that this methods is interesting in the large n medium precision setting).


Reviews: Neural Proximal Gradient Descent for Compressive Imaging

Neural Information Processing Systems

While my concerns were given significant attention in the rebuttal, I feel they were not fully addressed. In particular, regarding comparison with deep ADMM-net and LDAMP, the authors argue that these methods need more training data/training time. However, training time is normally not a big issue (you only train your model once, does it matter if it takes 2 hours or 10?). The *size* of the training data is however important, but no experiments are provided to show superior performance of the proposed method with respect to the the size of training data. This is surprising given that in l. 62. the authors say they use "much less training data" (addressing the challenge of "scarcity of training data" mentioned in l.4 in abstract), without referring back to this claimed contribution anywhere in the paper!