Goto

Collaborating Authors

 Gradient Descent



Reviews: Fast and Accurate Stochastic Gradient Estimation

Neural Information Processing Systems

Summary: This paper develops a new method for adaptively sampling training examples during stochastic optimization. It is known that the optimal distribution that minimizes the nuclear norm of the covariance of the gradient estimate is one where the the probability of sampling an example is proportional to the magnitude of the gradient of the loss on that example. Sampling according to this distribution is of course impractical, because computing this distribution is as expensive as computing the full gradient and requires O(N) time per iteration. To get around this, prior work either maintains a fixed distribution across all iterations or makes strong assumptions on the distribution of gradients of different training examples (e.g.: the gradients of training examples of the same class are similar). This paper proposes a method that can adaptively sample from different distributions every iteration and requires little assumptions on the distribution of gradients, and yet requires the same per-iteration cost as SGD.


Reviews: Fast and Accurate Stochastic Gradient Estimation

Neural Information Processing Systems

This paper received extensive discussion by the reviewers, the meta-reviewer, the SPC, etc. Here is a meta-review summary. The paper considers the problem of adaptively sampling training examples in stochastic optimization, and it shows that it is possible to do so without a per-iteration cost of O(N). This is of interest by itself, since one typically thinks that such sampling requires maintaining a distribution over training examples, which requires O(N) in every iteration, i.e., which is as expensive as full-batch gradient descent. A second aspect of this paper is that the mechanism by which the authors accomplish this is to use LSH, which is a sketching method usually used for nearest neighbor search.


Review for NeurIPS paper: Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks

Neural Information Processing Systems

Additional Feedback: Line 1: Change to "Natural Gradient Descent..." Line 10, 11: "the function space" should just be "function space" Line 15: it might be worth pointing out here and/or in the intro that a special kind of data preprocessing (the "Forster transform") is required to get this result for K-FAC in general Line 16, 46: "under some assumptions"/"under specific conditions" should perhaps be replaced with "under some approximating assumptions". AFAIK the "gradient independence assumption" doesn't have any rigorous justification and might not even be true in practice. Line 69: "New insights and perspectives on the natural gradient method" also argues that the empirical Fisher is a poor substitute for the "true" one. Line 71: first quotation make is backwards Line 79: delete "firing" here Line 88: "We normalize each sample by" should be "We normalize each sample so that" Line 90: "we overview" should be "we give an overview of" Line 116: Although the use of damping in the context of NTK theory can be explained this way, damping has a larger role in second order optimization in general (where NTK theory doesn't necessarily apply). The way you are describing it though, it sounds like you are saying its use is fully explained by this theory, and I would suggest you change this.


Review for NeurIPS paper: Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks

Neural Information Processing Systems

This is a compelling paper which covers a lot of ground while keeping the presentation accessible and engaging for the reader. Interestingly, it finds that the K-FAC approximations match the exact NGD trajectory in function space but not weight space. The paper answers quite a lot of questions which are natural to ask, and (having worked a lot in this area) I found the answers interesting and novel. The reviewers seem to have checked it over pretty carefully and didn't spot any problems. The paper is well written, and the authors have clearly paid a lot of attention to the presentation of the ideas.


Reviews: Continuous-time Models for Stochastic Optimization Algorithms

Neural Information Processing Systems

The paper presents an SDE approximation of mini-batch stochastic gradient descent and stochastic variance reduction gradient descent, two widely used methods, and they derive convergence rates. It presents a nice (i.e., not revolutionary, but still of interest to the community) result that fits within this area. Reviewers have a few suggestions for clarifications/improvements.


Constrained Hybrid Metaheuristic Algorithm for Probabilistic Neural Networks Learning

arXiv.org Artificial Intelligence

This study investigates the potential of hybrid metaheuristic algorithms to enhance the training of Probabilistic Neural Networks (PNNs) by leveraging the complementary strengths of multiple optimisation strategies. Traditional learning methods, such as gradient-based approaches, often struggle to optimise high-dimensional and uncertain environments, while single-method metaheuristics may fail to exploit the solution space fully. To address these challenges, we propose the constrained Hybrid Metaheuristic (cHM) algorithm, a novel approach that combines multiple population-based optimisation techniques into a unified framework. The proposed procedure operates in two phases: an initial probing phase evaluates multiple metaheuristics to identify the best-performing one based on the error rate, followed by a fitting phase where the selected metaheuristic refines the PNN to achieve optimal smoothing parameters. This iterative process ensures efficient exploration and convergence, enhancing the network's generalisation and classification accuracy. cHM integrates several popular metaheuristics, such as BAT, Simulated Annealing, Flower Pollination Algorithm, Bacterial Foraging Optimization, and Particle Swarm Optimisation as internal optimisers. To evaluate cHM performance, experiments were conducted on 16 datasets with varying characteristics, including binary and multiclass classification tasks, balanced and imbalanced class distributions, and diverse feature dimensions. The results demonstrate that cHM effectively combines the strengths of individual metaheuristics, leading to faster convergence and more robust learning. By optimising the smoothing parameters of PNNs, the proposed method enhances classification performance across diverse datasets, proving its application flexibility and efficiency.


Scaling of hardware-compatible perturbative training algorithms

arXiv.org Artificial Intelligence

In this work, we explore the capabilities of multiplexed gradient descent (MGD), a scalable and efficient perturbative zeroth-order training method for estimating the gradient of a loss function in hardware and training it via stochastic gradient descent. We extend the framework to include both weight and node perturbation, and discuss the advantages and disadvantages of each approach. We investigate the time to train networks using MGD as a function of network size and task complexity. Previous research has suggested that perturbative training methods do not scale well to large problems, since in these methods the time to estimate the gradient scales linearly with the number of network parameters. However, in this work we show that the time to reach a target accuracy--that is, actually solve the problem of interest--does not follow this undesirable linear scaling, and in fact often decreases with network size. Furthermore, we demonstrate that MGD can be used to calculate a drop-in replacement for the gradient in stochastic gradient descent, and therefore optimization accelerators such as momentum can be used alongside MGD, ensuring compatibility with existing machine learning practices. Our results indicate that MGD can efficiently train large networks on hardware, achieving accuracy comparable to backpropagation, thus presenting a practical solution for future neuromorphic computing systems.


Review for NeurIPS paper: Generalization Bound of Gradient Descent for Non-Convex Metric Learning

Neural Information Processing Systems

The SMILE algorithm is basically the Nadaraya-Watson estimator (with the Gaussian kernel with the Mahalanobis metric instead of the Euclidean metric) where the support vectors are also learnt instead of using training points as support vectors. It is not clear how much advantage does the SMILE classifier get by learning the representative instances. I suspect that they may account for a lot of the advantage SMILE has over other algorithms, since the SMILE classifier is the very simple Nadaraya-Watson estimator, as pointed out above. Moreover, the other competitor algorithms were not offered a chance to similarly learn nice prototypes. One way to rebut this criticism would be to run SMILE but restrict it to using a subset of training points or else award all other methods e.g. It would have been nice if some contemporary applications with VAE or deep metric learning could have been explored.


Review for NeurIPS paper: Generalization Bound of Gradient Descent for Non-Convex Metric Learning

Neural Information Processing Systems

The four referees support acceptance for the contribution and I also recommend acceptance. However, please consider revising your paper in order to address the lack of novelty of Lemma 1, to add the generalization bound of SMILE and to precise the fact that your result is rather general and not purely specific to metric learning.