Goto

Collaborating Authors

 Gradient Descent


Reviews: Without-Replacement Sampling for Stochastic Gradient Methods

Neural Information Processing Systems

The paper studies the problem of minimizing the average of a finite sum of convex functions over a convex domain using stochastic algorithms that, opposed to most popular methods, apply WITHOUT-replacement sampling to the data. More specifically, the paper considers methods that first randomly permute the functions, and then process the functions via incremental updates one at a time, making at most a single pass over the data (hence the data is only shuffled once). There are three main motivations for considering stochastic methods with without-replacement sampling: 1. it is observed many times empirically (and to the best of my knowledge this is what ML practitioners really do) that applying random shuffles to the data and then apply incremental updates works better than with-replacement SGD. 2. After the data is randomly permuted, these algorithms require only sequential access to the memory, which is much more efficient than standard with-replacement SGD methods that require random access to perform the updates. Since the main setting under consideration here is when making at most a single pass, it sounds plausible to assume that the data is already stored in some random permutation, and hence the algorithm is fully sequential, and there is no need to even artificially permute the data. In a distributed setting in which data is partitioned to several machine (not assuming the data is sampled i.i.d from a distribution), it is more natural and efficient to analyze without-replacement algorithms.


Reviews: Fast Algorithms for Robust PCA via Gradient Descent

Neural Information Processing Systems

This is an interesting work that presents the perhaps first theoretical guarantee for a widely used optimization technique of the problem of robust PCA. I think it has potential and should be somewhere between poster and oral. During the rebuttal period, I would suggest the authors to address the following minor concerns. That been said, it is not clear how one is able to know the true rank "r" and the corruption fraction "alpha". The projection step (Step 7 and Step 8) requires a knowledge of the incoherence parameter.


Reviews: Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Neural Information Processing Systems

Overall, I found the paper interesting; the paper offers new theory as well as numerical results comparable to the state of the art on decently difficult datasets. Perhaps due to space constraints, an important part of the paper (section 3.2) - the inference algorithm - is poorly explained. In particular, I initially thought that the use of particles meant that the approximating distribution was a sum of Dirac delta functions - but that cannot be the case since, even with many particles, the'posterior' would degenerate into the MAP (note that in similar work, authors either use particles when p(x) involves discrete x variables, as in Kulkarni et al, or'smooth' the particles to approximate a continuous distribution, as in Gershman et al). Instead, it looks like the algorithm works directly on samples of the distribution q0, q1.. (hence the vague'for whatever distribution q that {xi}ni 1 currently represents'). It is tempting to consider q_i to be a kernel density estimate (mixture of normals with fixed width), and see if we can approximate equation 9 for that representation to be stable.


Reviews: Stochastic Gradient MCMC with Stale Gradients

Neural Information Processing Systems

Technical quality: I think that the theory is very complete (bounds are given for pretty much everything relevant to the problem), and the experiments show that this method performs better on large/complicated models (the small/simple models have too little variance for extra servers to help, and the staleness prevents much benefits). I think the biggest limitation of the paper is the lack of comparison against the method in [14] (the paper mostly compares against the non-distributed -- 1 worker -- case, instead of a more standard distributed case). Novelty/originality: My impression is theoretical results are mostly a combination of proof techniques used in other SG-MCMC and asynchronous SGD papers (however, I'm not too sure that this claim is correct). Assuming this is true, I think the results are well-executed, but not too unique. Potential impact or usefulness: I think the theoretical analysis will be useful for people interested in how asynchrony affects SG-MCMC. However, I'm not too clear how much this will help for running SG-MCMC in practice.


Reviews: Variance Reduction in Stochastic Gradient Langevin Dynamics

Neural Information Processing Systems

I have one key concern, which may be a misunderstanding on my part as I did not check the supplementary section in detail. The update at each time step is computed using gradients of different parameter values (theta), some of which were generated arbitrarily many time steps ago. This dependence on previous samples means that the SAGA-LD chain is not Markov. The proofs seem to be based on a result for SG-MCMC chains, but I am not sure if the result easily applies to SAGA-LD because of the violation of the Markov property. Other than the above point, I think this is a very useful line of work.


Reviews: A Multi-Batch L-BFGS Method for Machine Learning

Neural Information Processing Systems

In supervised learning, one is interested in minimizing the empirical risk where efficient optimization algorithms become the key. First-order methods such as stochastic gradient descent and its variants are reasonably well understood admitting efficient implementation and parallelization techniques. However there has been a recent interest in making second-order methods such as Newton's method or L-BFGS method efficient for such large-scale problems. This paper is along this direction, presenting a new variant of the stochastic L-BFGS method that is efficient and robust in mainly two settings: The first arises in the presence of node failures in a distributed computing environment, the second occurs when one uses an adaptive batch size that varies over iterations for accelerating learning. The main idea is to form the Hessian estimate based on the overlap between consecutive batches (the intuition why this works is that we have less limitation in choosing the second-order information matrix compared to an estimate of the true gradient).


Reviews: Reshaped Wirtinger Flow for Solving Quadratic System of Equations

Neural Information Processing Systems

Comparing with previous work, including generalized phase retrieval problems, this paper has the following differences: 1) Solves a second order function incorporating absolute values of measurements 2) No step size normalization (or variants) 3) No gradient truncation The key intuition of this paper is that for a measurement ai, x with large magnitude, in the local region near global optima, the sign of ai, x and ai, z are likely to be same. In the case, locally the gradient update of reshaped WF (RWF) is as the same as an equivalent least squares problem. For the objective function including all the measurements, if most the components have reasonably large measurements, those with small measurements contribute less, hence using gradient descent good initialization should give good results. Such intuition is formalized in equation (34) and (35). Yet I have a question -- why don't one consider further truncating components with small magnitude when computing gradient? Just like in robust regression, TWF, etc, we can throw out "wrong" directions.


Reviews: Stochastic Structured Prediction under Bandit Feedback

Neural Information Processing Systems

Summary: This paper proposes a stochastic online learning method for the task of structured prediction. In this setting, the learner doest not get the correct structured output during training. Instead, it only gets bandit feedback from the labeler. The paper first proposes an online learning algorithm that learns model parameters via stochastic gradient descent; generalizes the learning method to pair-wise comparison of structured outputs; provides an optimization approach with Cross-Entropy Minimization; and theoretically analyzes the convergence property of the optimization approach. Pros: The paper proposes an online stochastic learning algorithm for minimizing the expected loss of structured predictions; gives a method of learning from pair-wise comparisons; and theoretical analyze the convergence rate.


Reviews: Stochastic Gradient Geodesic MCMC Methods

Neural Information Processing Systems

The extension of SGGMC from previous work (SGRHMC)[1] are in two folds. First, the proposed method use Geodesic flow rather than Riemmannian manifold. Second, the proposed method leverage a symmetric splitting integrator (ABOBA) scheme. However, unfortunately none of extensions have a clear and convincing novelty as far as I can see. The drop-in replacement of D and Q are not surprising.


Reviews: Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters

Neural Information Processing Systems

The initial motivation seems to be the work of Hoffman et al on the use of clustering to speedup stochastic methods for ERM. Their method was not proved to converge to the optimal due to the use of biased stochastic gradients. Also, that work seemed to work only for small clusters due to the approach chosen. This papers goes a long way to develop the basic idea into a satisfying theoretical framework which also gives rise to efficient implementations. This paper is truly a pleasure to read – a very fine example of academic exposition.