Gradient Descent
Reviews: Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent
This paper presents a formal condition for Byzantine fault-tolerant stochastic gradient aggregation, as well as an algorithm (Krum) that satisfies the condition. Given the definition, it is straightforward to see that aggregation based on linear combination (including averaging) is not Byzantine tolerant. Meanwhile the paper gives evidence that Krum is not only tolerant in theory but reasonably so in practice as well. I found the paper to be clear, well-organized, self-contained, and altogether fairly thorough. The basic motivation is clear: the literature on distributed learning has focused on statistical assumptions and well-behaved actors, but what can we do in very pessimistic conditions?
Reviews: Efficient Stochastic Gradient Hard Thresholding
The article analyses convergence in hard-thresholding algorithms and proposes an accelerated stochastic hybrid hard thresholding method that displays better convergence with respect to the compared methods. The article is dense but relatively fine to follow. Theoretical development seems to be complete and accurate, though I admit I have not throughly followed the full derivation. Experimental section is in accordance with the theoretical claims and is more than sufficient. Just for the sake of reproducibility of the results an exhaustive pseudocode or repository should be made available as a companion to the article to further strength the autor's points.
Reviews: A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization
This paper focuses on the optimization problem min f(x) h(x), where f is of a finite sum structure (with n functions in the sum), with nonconvex but smooth components, and h is a convex but possibly nonsmooth function. So, this is a nonconvex finite sum problem with a convex regularizer. Function h is treated using a prox step. The authors propose a small modification to ProxSVRG (called ProxSVRG), and prove that this small modification has surprisingly interesting consequences. The modification consists in replacing the full gradient computation in the outer loop of ProxSVRG by an approximation thereof through subsampling/minibatch (batch size B).
Reviews: Stochastic Spectral and Conjugate Descent Methods
This paper focuses on quadratic minimization and proposes a novel stochastic gradient descent algorithm by interpolating randomized coordinate descent (RCD) and stochastic spectral descent (SSD). Those two algorithms were studied in prior but they have drawbacks to apply for practical usage. Hence, authors combine them by random coin-tossing (i.e., with probability \alpha run RCD and with probability 1-\alpha run SSD) and provide the optimal parameters to achieve the best convergence rate. In addition, they study the rate of their algorithm for some special cases, e.g., a matrix in objective has clustered or exponentially decaying eigenvalues. In experiments, they report practical behavior of their algorithm under synthetic settings. The proposed algorithm takes advantages of two prior works, i.e., efficient gradient update from RCD and fast convergence from SSD.
Reviews: Stochastic Cubic Regularization for Fast Nonconvex Optimization
This submission is interested in stochastic nonconvex optimization problems, in which only stochastic estimates of the objective and its derivatives can be accessed at every iteration. The authors develop a variant of the cubic regularization framework, that only requires access to stochastic gradients and products of stochastic Hessians with vectors. Such a method is shown to reach a point at which the gradient norm is smaller than \epsilon and the minimum Hessian eigenvalue is bigger than -\sqrt{\rho \epsilon} in a number of stochastic queries (gradient or Hessian-vector product) of order of \epsilon {-3.5}, which improves over the classical complexity of Stochastic Gradient Descent (SGD). The problem of interest is clearly introduced, along with the possible advantages of using both stochastic estimates and a cubic regularization framework. The associated literature is correctly reviewed, and the authors even cite contemporary work that achieves similar complexity guarantees but rely on stochastic gradient estimates and variance reduction.
Reviews: Convergence Analysis of Two-layer Neural Networks with ReLU Activation
This paper proves the convergence of a stochastic gradient descent algorithm from a suitable starting point to the global minimizer of a nonconvex energy representing the loss of a two-layer feedforward network with rectified linear unit activation. In particular, the algorithm is shown to converge in two phases, where phase 1 drives the iterates into a one-point convex region which subsequently leads to the actual convergence in phase 2. The findings, the analysis, and particularly the methodology for proving the convergence (in 2 phases) are very interesting and definitely deserve to be published. The entire proof is extremely long (including a flowchart of 15 Lemmas/Theorems that finally allow to show the main theorem in 25 pages of proofs), and I have to admit that I did not check this part. I have some questions on the paper and suggestions to further improve the manuscript: - Figure 1 points out the different structure of the considered networks, and even the abstract already refers to the special structure of "identity mappings". However, optimizing for W (vanilla network) is equivalent to optimizing for (W I), such that the difference seems to lie in different starting points only.
Reviews: Variance-Reduced Stochastic Gradient Descent on Streaming Data
This paper considers the problem of streaming stochastic optimization taking into account the arrival patterns of examples in time. Whereas the relevant previous work focuses on learning from a stream (i.e. in one pass over the data, with O(dimension) memory), this work attempts to utilize the time spent between the arrival of examples in order to revisit previously encountered examples. The problem description is novel and appealing, both in its possible practical relevance and as an interesting new model for consideration in algorithm analysis. However, there are central issues with the comparisons that the paper draws between algorithms, both conceptually and experimentally. Conceptually, it seems incorrect to say that STRSAGA is a streaming algorithm, and in turn in to repeatedly compare it to one (such as SSVRG), since STRSAGA's memory complexity actually grows linearly with the dataset size (in order to maintain its "effective sample set").
Reviews: Scalable Planning with Tensorflow for Hybrid Nonlinear Domains
This paper documents an approach to planning in domains with hybrid state and action spaces, using efficient stochastic gradient descent methods, in particular, in this case, as implemented in TensorFlow. Certainly, the idea of optimizing plans or trajectories using gradient methods is not new (lots of literature on shooting methods, using fmincon, etc. exists). And, people have long understood that random restarts for gradient methods in non-linear problems are a way to mitigate problems with local optimal. What this paper brings is (1) the idea of running those random restarts in parallel and (b) using an existing very efficient implementation of SGD. I'm somewhat torn, because it seems like a good idea, but also not very surprising.
Reviews: Sample Efficient Stochastic Gradient Iterative Hard Thresholding Method for Stochastic Sparse Linear Regression with Limited Attribute Observation
This looks an interesting result. Authors suggest a couple of methods, inspired by "mini batch strategy", for solving linear regression when the signal of interest is sparse and one can observe the data partially. The problem and their proposed methods are clearly described in the paper. Theoretical guarantees for convergence of the algorithms are provided which is supported by practical evidence given in the paper. As the authors suggest, these methods look to outperform other existing methods in term of "sample complexity".