Goto

Collaborating Authors

 Gradient Descent


Reviews: Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss

Neural Information Processing Systems

This paper provides theoretical analysis and empirical examples for two phenomenon in active learning. The first is it could be possible that the 0-1 loss on subset of the entire dataset generated uncertainty sampling is smaller than learning using the whole dataset. The second is uncertainty sampling could "converge" to different models and predictive results. In the analysis, it is shown that the reason for these is the expected gradient of the "surrogate" loss of the most uncertain point is in the direction of the gradient of the current 0-1 loss. This result is based on the setup that the most uncertain point is sampled from a minipool that is a subset sampled without replacement randomly from the entire set.


Reviews: Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis

Neural Information Processing Systems

Summary The paper describes a generic 2nd order stochastic optimisation scheme exploiting curvature information to improve the trade-off between convergence speed und computational effort. It proposes an extension to the approximate natural gradient method KFAC where the Fisher information matrix is restricted to be of Kronecker structure. The authors propose to relax the Kronecker constraint and suggest to use a general diagonal scaling matrix rather than a diagonal Kronecker scaling matrix. This diagonal scaling matrix is estimated from gradients along with the Kronecker eigenbasis. Quality The idea in the paper is convincing and makes sense.


Reviews: Inference in Deep Gaussian Processes using Stochastic Gradient Hamiltonian Monte Carlo

Neural Information Processing Systems

Update after rebuttal: I think the rebuttal is fair. It is very reassuring that pseudocode will be provided to the readers. I therefore keep my decision unchanged. Original review: In the paper "Inference in Deep Gaussian Processes using Stochastic Gradient Hamiltonian Monte Carlo" the author(s) consider the problem of inference for deep gaussian processes (DGPs). Given the large number of layers and width of each layer, direct inference is computaitonal infeasible, which has motivated numerous variational inference methods to approximate the posterior distribution, for example doubly stochastic variational inference (DSVI) of [Salimbeni and Deisenroth, 2017] The authors argue that these unimodal approximations are typically poor given the multimodal and non-Gaussian nature of the posterior.


Reviews: Zeroth-order (Non)-Convex Stochastic Optimization via Conditional Gradient and Gradient Updates

Neural Information Processing Systems

Updated comments: I carefully read the authors' response and the paper again. I was mistaken when I read Algorithms 1 (Eq(2.3)) The authors control the variance by simply averaging a batch of unbiased stochastic gradients. In this paper, the authors considered the problem of zeroth-order (non-)convex stochastic optimization via conditional gradient and gradient methods. However, all the techniques are already known but none are mentioned in the paper.


Reviews: ATOMO: Communication-efficient Learning via Atomic Sparsification

Neural Information Processing Systems

After rebutal; I do not wish to change my evaluation. Regarding convergence, I think that this should be clarified in the paper, to at least ensure that this is not producting divergent sequences under resaonable assumptions. As for the variance, the author control the variance of a certain variable \hat{g} given g but they should control the variance of \hat{g} without conditioning to invoke general convergence results. This is very minor but should be mentioned. The authors consider the problem of empirical risk minimization using a distributed stochastic gradient descent algorithm.


Reviews: The promises and pitfalls of Stochastic Gradient Langevin Dynamics

Neural Information Processing Systems

Review after rebuttal: I thank the author(s) for their response. While I still believe that this paper is a minor increment beyond what has already been done on SGLD, I agree that the message might be useful for some. I also appreciate the effort the authors have made in improving the manuscript based on reviews' suggestions, particularly their efforts to include relevant numerical experiments to ML scenarios, and recommendations beyond the CV approach which has been studied to exhaustion and rarely applicable in practice. Based on this, I've adjusted my decision to marginally above threshold. Original review: In the paper "The promises and pitfalls of Stochastic Gradient Langevin Dynamics" the authors revisit the Stochastic Langevin Gradient Dynamics (SGLD) approach to approximately sampling from a probability distribution using stochastic gradients (specifically subsampling). The authors compare a number of different classes of approximate inference method, including SGLD, LMC (known by some as Unadjusted Langevin Algorithm or ULA) and Stochastic Gradient Langevin Dynamics Fixed Point (SGLDFP) -- the latter being a variant of SGLD with a control variate exploiting the unimodality of the distribution, similar to what has been presented in [3, 25 and others].


Reviews: Gradient Descent for Spiking Neural Networks

Neural Information Processing Systems

This paper introduces a smooth thresholding technique which enables practically standard gradient descent optimization to be applied to spiking neural networks. Since the spiking threshold is usually set at a certain membrane potential, the function "spike or no spike" is a function of voltage whose distributional derivative is a dirac Delta at the threshold. By replacing this Dirac delta by a finite positive function g(v) with tight support around the threshold, and which integrates to 1, the step function "spike or no spike" is replaced by a function that increases continuously from 0 to 1 across the support of g. In turn, this setup can be placed into standard differential equation models governing spikes, while retaining the possibility of having meaningful gradient signal for parameter optimization. Two experiments are evaluated, an autoencoding task and a delayed-memory-XOR task, which are both shown to be trainable with the proposed setup.


Reviews: The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization

Neural Information Processing Systems

The main contribution of the paper can be summarized in two results (stated in the inclusion following line 83): - local saddles are stable for GDA (under Assumption 8.1) - stable equilibria of GDA are also stable for OGDA. Quality: The results are interesting, and the paper is well written. There are some typos in the proofs, but I believe these are omissions that can be corrected, rather than major flaws. Significance: I would love to see further discussion of the consequences of this result, and its relevance to the NIPS community, both theoreticians and practitioners. For example, do these results suggest that GDA should be preferred to OGDA (since the latter has a larger equilibrium set)?


Reviews: On Markov Chain Gradient Descent

Neural Information Processing Systems

POST REBUTTAL: I do think that the edit to the proof suggested by the authors could work, but would lead to some exorbitant constant C4, a subject not addressed by the authors. Still, I have increased my score from "clear reject" to "accept" in the light of the fact that I am now happy with the validity of the proofs.


Reviews: Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization

Neural Information Processing Systems

The paper proposes a stochastic nested variance reduced gradient descent method for non-convex finite-sum optimization. It has been studied that variance reduction in stochastic gradient evaluations improves the complexity of stochastic gradient evaluations. A popular method is stochastic variance reduced gradient (SVRG), which uses a single reference point to evaluate the gradient. Inspired by this, authors introduce variance reduction using multiple reference points with nested scheme. More precisely, each reference point updates in every T steps and the proposed algorithm uses K points and hence one-epoch iterates T K loops.