Goto

Collaborating Authors

 Gradient Descent


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper studies the interesting question on online (stochastic) gradient descent in the unconstrained setting (sometime referred to non-strongly convexity or without explicit regularization). In the earlier work [33], it was proved that the last iterate of stochastic gradient descent with least-square loss in the unconstrained setting actually converges with explicit convergence rate by appropriately choosing the step sizes (or by a stopping early rule), which, however, needs to know the smoothness of the regression function. The paper proposed a kernel-based stochastic gradient descent algorithm without the need to perform model selection, which requires the loss function and its gradient are both Lipschitz. The proposed algorithm is mainly motivated by the recent studies [15,16] which involves a data-dependent regularization.


Scalable Kernel Methods via Doubly Stochastic Gradients

Neural Information Processing Systems

The general perception is that kernel methods are not scalable, so neural nets become the choice for large-scale nonlinear learning problems. Have we tried hard enough for kernel methods? In this paper, we propose an approach that scales up kernel methods using a novel concept called doubly stochastic functional gradients''. Based on the fact that many kernel methods can be expressed as convex optimization problems, our approach solves the optimization problems by making two unbiased stochastic approximations to the functional gradient---one using random training points and another using random features associated with the kernel---and performing descent steps with this noisy functional gradient. Our algorithm is simple, need no commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting.



Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Summary of paper and review: In this paper, the authors consider stochastic-gradient algorithms, and using an importance/weighted sampling scheme, they show how it is possible to attain faster convergence rates in certain regimes. In particular, for strongly convex problems, the authors show how--if one knows Lipschitz constants of every term in a finite sum objective--it is possible to attain convergence rates that depend not on a squared norm of Lipschitz constants but on a 1-norm-like quantity, which is always smaller. The downside of this approach is that one must know these Lipschitz constants, and it is difficult (perhaps impossible) to apply the results to objectives that are not of the from f(x) = \sum_{i=1}^n f_i(x). I am also not convinced that I should care to use these algorithms; the lack of empirical insights leaves me wondering if this analysis matters. Detailed comments: The idea here is a simple enough idea, and makes sense.



Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

"NIPS Neural Information Processing Systems 8-11th December 2014, Montreal, Canada",,, "Paper ID:","24" "Title:","Communication Efficient Distributed Machine Learning with the Parameter Server" Current Reviews First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper presents improvements on a system for large-scale learning known as parameter server. The parameter server is designed to perform reliable distributed machine learning in large-scale industrial systems (1000's of nodes). The architecture is based on a bipartite graph composed by servers and workers. Workers compute gradients based on subsets of the training instances, while servers aggregate the workers' gradients, update the shared parameter vector and redistribute it to the workers for the next iteration.


Response to Reviewer 1: 3

Neural Information Processing Systems

We thank all reviewers for their comments and acknowledgeme nt of our contribution. Below we address each reviewer's comments separately. The reviewer raised a very good point. We will add this clarification in the revised version. Our gradient-based method is much more efficient but only finds a stationary point.


Flatness-Aware Stochastic Gradient Langevin Dynamics

arXiv.org Machine Learning

Generalization in deep learning is closely tied to the pursuit of flat minima in the loss landscape, yet classical Stochastic Gradient Langevin Dynamics (SGLD) offers no mechanism to bias its dynamics toward such low-curvature solutions. This work introduces Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), designed to efficiently and provably seek flat minima in high-dimensional nonconvex optimization problems. At each iteration, fSGLD uses the stochastic gradient evaluated at parameters perturbed by isotropic Gaussian noise, commonly referred to as Random Weight Perturbation (RWP), thereby optimizing a randomized-smoothing objective that implicitly captures curvature information. Leveraging these properties, we prove that the invariant measure of fSGLD stays close to a stationary measure concentrated on the global minimizers of a loss function regularized by the Hessian trace whenever the inverse temperature and the scale of random weight perturbation are properly coupled. This result provides a rigorous theoretical explanation for the benefits of random weight perturbation. In particular, we establish non-asymptotic convergence guarantees in Wasserstein distance with the best known rate and derive an excess-risk bound for the Hessian-trace regularized objective. Extensive experiments on noisy-label and large-scale vision tasks, in both training-from-scratch and fine-tuning settings, demonstrate that fSGLD achieves superior or comparable generalization and robustness to baseline algorithms while maintaining the computational cost of SGD, about half that of SAM. Hessian-spectrum analysis further confirms that fSGLD converges to significantly flatter minima.


Adaptive Kernel Selection for Stein Variational Gradient Descent

arXiv.org Machine Learning

A central challenge in Bayesian inference is efficiently approximating posterior distributions. Stein Variational Gradient Descent (SVGD) is a popular variational inference method which transports a set of particles to approximate a target distribution. The SVGD dynamics are governed by a reproducing kernel Hilbert space (RKHS) and are highly sensitive to the choice of the kernel function, which directly influences both convergence and approximation quality. The commonly used median heuristic offers a simple approach for setting kernel bandwidths but lacks flexibility and often performs poorly, particularly in high-dimensional settings. In this work, we propose an alternative strategy for adaptively choosing kernel parameters over an abstract family of kernels. Recent convergence analyses based on the kernelized Stein discrepancy (KSD) suggest that optimizing the kernel parameters by maximizing the KSD can improve performance. Building on this insight, we introduce Adaptive SVGD (Ad-SVGD), a method that alternates between updating the particles via SVGD and adaptively tuning kernel bandwidths through gradient ascent on the KSD. We provide a simplified theoretical analysis that extends existing results on minimizing the KSD for fixed kernels to our adaptive setting, showing convergence properties for the maximal KSD over our kernel class. Our empirical results further support this intuition: Ad-SVGD consistently outperforms standard heuristics in a variety of tasks.


Learning Low-Dimensional Embeddings for Black-Box Optimization

arXiv.org Artificial Intelligence

Black-box optimization (BBO) aims to find the solution of an optimization problem where the objective function is unknown or lacks an explicit mathematical formulation. Instead, the function can only be evaluated through queries, such as physical experiments or simulations through complex computational models. However, in many real-world scenarios, these evaluations are expensive, noisy, or time-consuming. Therefore, a key goal of black-box optimization is to find near-optimal solutions while limiting the number of costly function evaluations. Due to these challenges, BBO relies on sample-efficient strategies that select query points to balance exploration (searching unexplored regions) and exploitation (refining promising solutions).