Goto

Collaborating Authors

 Gradient Descent


Review for NeurIPS paper: On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

Neural Information Processing Systems

Weaknesses: There are a lot similar results in slightly different regime, which makes this work looks incremental. In the case of GD, this Morse assumption can be resolved by using a stronger stable manifold theorem in "Michael Shub. I suspect a similar combination might go through here? Usually one view asymptotic results (this paper) weaker than non-asymptotic results (earlier papers), it is also not clear from this paper if one can obtain probability 1 result by modifying the existing high probability result with Borel Cantelli lemma and a bit extra work.


Review for NeurIPS paper: On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

Neural Information Processing Systems

The paper has claimed three contributions, answering three questions. In general, I think removing boundedness assumption is useful. But Reviewer 3 pointed out that this work adds the "bounded level set" assumption and bounded gradient assumption, which do not seem much stronger than "bounded iterates". In particular, the gap between "bounded level-set" and "bounded iterates" is quite technical (not necessarily small, but not necessarily large), thus requires more explanation. R1 clarified that he was not claiming "[17, 18] or Pemantle alone already covered this paper's result", but asking whether a simple combination of [17, 18] and Pemantle [27] would give their result (that's why R1 used "morally" in the review).


Review for NeurIPS paper: Sinkhorn Barycenter via Functional Gradient Descent

Neural Information Processing Systems

Weaknesses: The constants in the bounds depend linearly on the dimension, although they depends exponentially on the regularization parameter. If Sinkhorn distance is thought as a proxy of the Wasserstein distance, this seems to be a hidden dependance on the dimension, since the regularization parameter plays the role of an interpolation between MMD and Wasserstein distances, and MMD distances are more blind to the dimension. This is not discussed in the paper. The results also have an exponential dependence on an assumed uniform upper bound on the cost. For the classical quadratic cost, this imply an exponential dependence on the dimension for the case of measures supported on [0,1] d for instance.


Review for NeurIPS paper: Sinkhorn Barycenter via Functional Gradient Descent

Neural Information Processing Systems

This paper proposes a new method to compute the (Sinkhorn) of barycenter of several probability measures. In practice, the method scales well computationally and in high-dimension and the authors provide some theoretical support. Reviewers agree that this paper is strong with only minor weaknesses (such as the exponential dependency in 1/gamma; which in related work is often suboptimal). I thus recommend accept (poster).


Benchmarking Randomized Optimization Algorithms on Binary, Permutation, and Combinatorial Problem Landscapes

arXiv.org Artificial Intelligence

In this paper, we evaluate the performance of four randomized optimization algorithms: Randomized Hill Climbing (RHC), Simulated Annealing (SA), Genetic Algorithms (GA), and MIMIC (Mutual Information Maximizing Input Clustering), across three distinct types of problems: binary, permutation, and combinatorial. We systematically compare these algorithms using a set of benchmark fitness functions that highlight the specific challenges and requirements of each problem category. Our study analyzes each algorithm's effectiveness based on key performance metrics, including solution quality, convergence speed, computational cost, and robustness. Results show that while MIMIC and GA excel in producing high-quality solutions for binary and combinatorial problems, their computational demands vary significantly. RHC and SA, while computationally less expensive, demonstrate limited performance in complex problem landscapes. The findings offer valuable insights into the trade-offs between different optimization strategies and provide practical guidance for selecting the appropriate algorithm based on the type of problems, accuracy requirements, and computational constraints.


MirrorCBO: A consensus-based optimization method in the spirit of mirror descent

arXiv.org Artificial Intelligence

In this work we propose MirrorCBO, a consensus-based optimization (CBO) method which generalizes standard CBO in the same way that mirror descent generalizes gradient descent. For this we apply the CBO methodology to a swarm of dual particles and retain the primal particle positions by applying the inverse of the mirror map, which we parametrize as the subdifferential of a strongly convex function $\phi$. In this way, we combine the advantages of a derivative-free non-convex optimization algorithm with those of mirror descent. As a special case, the method extends CBO to optimization problems with convex constraints. Assuming bounds on the Bregman distance associated to $\phi$, we provide asymptotic convergence results for MirrorCBO with explicit exponential rate. Another key contribution is an exploratory numerical study of this new algorithm across different application settings, focusing on (i) sparsity-inducing optimization, and (ii) constrained optimization, demonstrating the competitive performance of MirrorCBO. We observe empirically that the method can also be used for optimization on (non-convex) submanifolds of Euclidean space, can be adapted to mirrored versions of other recent CBO variants, and that it inherits from mirror descent the capability to select desirable minimizers, like sparse ones. We also include an overview of recent CBO approaches for constrained optimization and compare their performance to MirrorCBO.


Quantitative Error Bounds for Scaling Limits of Stochastic Iterative Algorithms

arXiv.org Machine Learning

Stochastic iterative algorithms, including stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD), are widely utilized for optimization and sampling in large-scale and high-dimensional problems in machine learning, statistics, and engineering. Numerous works have bounded the parameter error in, and characterized the uncertainty of, these approximations. One common approach has been to use scaling limit analyses to relate the distribution of algorithm sample paths to a continuous-time stochastic process approximation, particularly in asymptotic setups. Focusing on the univariate setting, in this paper, we build on previous work to derive non-asymptotic functional approximation error bounds between the algorithm sample paths and the Ornstein-Uhlenbeck approximation using an infinite-dimensional version of Stein's method of exchangeable pairs. We show that this bound implies weak convergence under modest additional assumptions and leads to a bound on the error of the variance of the iterate averages of the algorithm. Furthermore, we use our main result to construct error bounds in terms of two common metrics: the L\'{e}vy-Prokhorov and bounded Wasserstein distances. Our results provide a foundation for developing similar error bounds for the multivariate setting and for more sophisticated stochastic approximation algorithms.


Reviews: Optimal Learning for Multi-pass Stochastic Gradient Methods

Neural Information Processing Systems

This work provides a strong contribution in that it apparently is the first work to net optimal rates (up to log factors) for SGM, and moreover, it also handles a mini-batch analysis which includes the (full) batch method as a special case. Such rates previously had been established only for the (batch) ridge regression method. My interpretation of what all the results actually show is given in the Summary. I find the current solution of relying on cross-validation for adaptation to be a bit of an inelegant cop-out (even if there is a theoretically-supported method for using it); given that several of your corollaries provide a guarantee where \zeta and \gamma enter the picture only through T *, can you provide a self-monitoring method that decides when to stop? In particular, I find the most exciting results to be Corollaries 3.3 and 3.9, as only the stopping time depends on the (unknown) capacity parameters, and so such an online stopping mechanism might be possible.


Reviews: Learning to learn by gradient descent by gradient descent

Neural Information Processing Systems

Comments (mostly on related work): Authors: "The idea of using learning to learn or meta-learning to acquire knowledge or inductive biases has a long history [Thrun and Pratt, 1998]." But the intro to this reference is muddling the waters by confusing meta-learning (which is about learning the learning algorithm itself) and transfer learning, subsuming basically everything under "meta-learning," even standard back-propagation, because it can be applied to some data set, and then may learn new data points more quickly (so this is just standard transfer learning). To my knowledge, the first work on learning general learning algorithms written in a universal programming language was published in 1987: J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook. Authors: "This work was built on by [Younger et al., 2001, Hochreiter et al., 2001] wherein a higher-level network act as a gradient descent procedure, with both levels trained during learning. Alternatively Schmidhuber [1992, 1993] considers networks that are able to modify their own behavior and act as an alternative to recurrent networks in meta-learning. Note, however that these earlier works do not directly address the transfer of a learned training procedure to novel problem instances and instead focus on adaptivity in the online setting."


Reviews: Barzilai-Borwein Step Size for Stochastic Gradient Descent

Neural Information Processing Systems

It allows using "Option I" (taking the final iterate of the inner iteration), as is done in practice. They also propose to use a scaled version of Barzilai-Borwein to set the step-sizse for SVRG (and heuristically argue that this could also be useful for classic stochastic gradient methods too). Their experiments show that this adaptive step-size is competitive with fixed step-sizes. Clarity: The paper is very clearly-written and easy to understand (though many grammar issues remain). Significance: Although several heuristic adaptive step-size strategies exist in the literature, this is the first theoretically-justified method. It sill depends on constants that we don't know in general, but I believe is a step towards black-box SG methods. Details: Independent of the SVRG/SG results, the authors give a nice way to bound the step-size for the BB method. Normally, BB leads to a much faster rate than using a constant step-size, but in the SVRG setting your theory/experiments are just showing that it does as well as the best step-size (which is good, but it isn't better than the best step size). Finally, the paper would be much stronger if it compared to the two existing strategies that are used in practice: 1.