AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Reviews: Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent

Neural Information Processing SystemsOct-8-2024, 13:08:53 GMT

This paper presents a formal condition for Byzantine fault-tolerant stochastic gradient aggregation, as well as an algorithm (Krum) that satisfies the condition. Given the definition, it is straightforward to see that aggregation based on linear combination (including averaging) is not Byzantine tolerant. Meanwhile the paper gives evidence that Krum is not only tolerant in theory but reasonably so in practice as well. I found the paper to be clear, well-organized, self-contained, and altogether fairly thorough. The basic motivation is clear: the literature on distributed learning has focused on statistical assumptions and well-behaved actors, but what can we do in very pessimistic conditions?

byzantine tolerant gradient descent, experiment, machine learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)

Add feedback

Reviews: Learning ReLUs via Gradient Descent

Neural Information Processing SystemsOct-8-2024, 11:12:40 GMT

The analysis assumes the input samples are iid Gaussian and the output is realizable, i.e. the target value y_k for input x_k is constructed via y_k ReLU(w *.x_k). The paper studies a regression setting and uses least squares as the loss function.

gradient descent, learning relus, review, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.45)

Add feedback

Reviews: Efficient Stochastic Gradient Hard Thresholding

Neural Information Processing SystemsOct-8-2024, 08:47:48 GMT

The article analyses convergence in hard-thresholding algorithms and proposes an accelerated stochastic hybrid hard thresholding method that displays better convergence with respect to the compared methods. The article is dense but relatively fine to follow. Theoretical development seems to be complete and accurate, though I admit I have not throughly followed the full derivation. Experimental section is in accordance with the theoretical claims and is more than sufficient. Just for the sake of reproducibility of the results an exhaustive pseudocode or repository should be made available as a companion to the article to further strength the autor's points.

convergence, efficient stochastic gradient hard thresholding, review

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Reviews: A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization

Neural Information Processing SystemsOct-8-2024, 08:12:46 GMT

This paper focuses on the optimization problem min f(x) h(x), where f is of a finite sum structure (with n functions in the sum), with nonconvex but smooth components, and h is a convex but possibly nonsmooth function. So, this is a nonconvex finite sum problem with a convex regularizer. Function h is treated using a prox step. The authors propose a small modification to ProxSVRG (called ProxSVRG), and prove that this small modification has surprisingly interesting consequences. The modification consists in replacing the full gradient computation in the outer loop of ProxSVRG by an approximation thereof through subsampling/minibatch (batch size B).

artificial intelligence, machine learning, proxsvrg, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.42)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.42)

Add feedback

Reviews: Stochastic Spectral and Conjugate Descent Methods

Neural Information Processing SystemsOct-8-2024, 08:12:24 GMT

This paper focuses on quadratic minimization and proposes a novel stochastic gradient descent algorithm by interpolating randomized coordinate descent (RCD) and stochastic spectral descent (SSD). Those two algorithms were studied in prior but they have drawbacks to apply for practical usage. Hence, authors combine them by random coin-tossing (i.e., with probability \alpha run RCD and with probability 1-\alpha run SSD) and provide the optimal parameters to achieve the best convergence rate. In addition, they study the rate of their algorithm for some special cases, e.g., a matrix in objective has clustered or exponentially decaying eigenvalues. In experiments, they report practical behavior of their algorithm under synthetic settings. The proposed algorithm takes advantages of two prior works, i.e., efficient gradient update from RCD and fast convergence from SSD.

algorithm, spectral and conjugate descent method, stochastic spectral, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.98)

Add feedback

Reviews: Stochastic Cubic Regularization for Fast Nonconvex Optimization

Neural Information Processing SystemsOct-8-2024, 06:22:10 GMT

This submission is interested in stochastic nonconvex optimization problems, in which only stochastic estimates of the objective and its derivatives can be accessed at every iteration. The authors develop a variant of the cubic regularization framework, that only requires access to stochastic gradients and products of stochastic Hessians with vectors. Such a method is shown to reach a point at which the gradient norm is smaller than \epsilon and the minimum Hessian eigenvalue is bigger than -\sqrt{\rho \epsilon} in a number of stochastic queries (gradient or Hessian-vector product) of order of \epsilon {-3.5}, which improves over the classical complexity of Stochastic Gradient Descent (SGD). The problem of interest is clearly introduced, along with the possible advantages of using both stochastic estimates and a cubic regularization framework. The associated literature is correctly reviewed, and the authors even cite contemporary work that achieves similar complexity guarantees but rely on stochastic gradient estimates and variance reduction.

fast nonconvex optimization, stochastic cubic regularization, submission, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.99)

Add feedback

Reviews: Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Neural Information Processing SystemsOct-8-2024, 06:13:08 GMT

This paper proves the convergence of a stochastic gradient descent algorithm from a suitable starting point to the global minimizer of a nonconvex energy representing the loss of a two-layer feedforward network with rectified linear unit activation. In particular, the algorithm is shown to converge in two phases, where phase 1 drives the iterates into a one-point convex region which subsequently leads to the actual convergence in phase 2. The findings, the analysis, and particularly the methodology for proving the convergence (in 2 phases) are very interesting and definitely deserve to be published. The entire proof is extremely long (including a flowchart of 15 Lemmas/Theorems that finally allow to show the main theorem in 25 pages of proofs), and I have to admit that I did not check this part. I have some questions on the paper and suggestions to further improve the manuscript: - Figure 1 points out the different structure of the considered networks, and even the abstract already refers to the special structure of "identity mappings". However, optimizing for W (vanilla network) is equivalent to optimizing for (W I), such that the difference seems to lie in different starting points only.

convergence analysis, relu activation, two-layer neural network, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)

Add feedback

Reviews: Variance-Reduced Stochastic Gradient Descent on Streaming Data

Neural Information Processing SystemsOct-8-2024, 04:33:32 GMT

This paper considers the problem of streaming stochastic optimization taking into account the arrival patterns of examples in time. Whereas the relevant previous work focuses on learning from a stream (i.e. in one pass over the data, with O(dimension) memory), this work attempts to utilize the time spent between the arrival of examples in order to revisit previously encountered examples. The problem description is novel and appealing, both in its possible practical relevance and as an interesting new model for consideration in algorithm analysis. However, there are central issues with the comparisons that the paper draws between algorithms, both conceptually and experimentally. Conceptually, it seems incorrect to say that STRSAGA is a streaming algorithm, and in turn in to repeatedly compare it to one (such as SSVRG), since STRSAGA's memory complexity actually grows linearly with the dataset size (in order to maintain its "effective sample set").

algorithm, strsaga, variance-reduced stochastic gradient descent, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

Reviews: Scalable Planning with Tensorflow for Hybrid Nonlinear Domains

Neural Information Processing SystemsOct-8-2024, 04:32:59 GMT

This paper documents an approach to planning in domains with hybrid state and action spaces, using efficient stochastic gradient descent methods, in particular, in this case, as implemented in TensorFlow. Certainly, the idea of optimizing plans or trajectories using gradient methods is not new (lots of literature on shooting methods, using fmincon, etc. exists). And, people have long understood that random restarts for gradient methods in non-linear problems are a way to mitigate problems with local optimal. What this paper brings is (1) the idea of running those random restarts in parallel and (b) using an existing very efficient implementation of SGD. I'm somewhat torn, because it seems like a good idea, but also not very surprising.

hybrid nonlinear domain, scalable planning, tensorflow, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)

Add feedback

Reviews: Sample Efficient Stochastic Gradient Iterative Hard Thresholding Method for Stochastic Sparse Linear Regression with Limited Attribute Observation

Neural Information Processing SystemsOct-8-2024, 03:07:25 GMT

This looks an interesting result. Authors suggest a couple of methods, inspired by "mini batch strategy", for solving linear regression when the signal of interest is sparse and one can observe the data partially. The problem and their proposed methods are clearly described in the paper. Theoretical guarantees for convergence of the algorithms are provided which is supported by practical evidence given in the paper. As the authors suggest, these methods look to outperform other existing methods in term of "sample complexity".

complexity, gradient iterative hard thresholding method, sample complexity, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback