Goto

Collaborating Authors

 Mathematical & Statistical Methods


Review for NeurIPS paper: Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

Neural Information Processing Systems

This paper presents a new analysis for proximal stochastic gradient Langevin algorithm. All the reviewers recognized the intellectual merits with minor concerns. I would suggest the authors to carefully revise their paper based the reviewers' comments.


On the Exploration of Local Significant Differences For Two-Sample Test

Neural Information Processing Systems

Recent years have witnessed increasing attentions on two-sample test with diverse real applications, while this work takes one more step on the exploration of local significant differences for two-sample test.


Random Quadratic Forms with Dependence: Applications to Restricted Isometry and Beyond

Neural Information Processing Systems

Several important families of computational and statistical results in machine learning and randomized algorithms rely on uniform bounds on quadratic forms of random vectors or matrices. Such results include the Johnson-Lindenstrauss (J-L) Lemma, the Restricted Isometry Property (RIP), randomized sketching algorithms, and approximate linear algebra. The existing results critically depend on statistical independence, e.g., independent entries for random vectors, independent rows for random matrices, etc., which prevent their usage in dependent or adaptive modeling settings. In this paper, we show that such independence is in fact not needed for such results which continue to hold under fairly general dependence structures. In particular, we present uniform bounds on random quadratic forms of stochastic processes which are conditionally independent and sub-Gaussian given another (latent) process. Our setup allows general dependencies of the stochastic process on the history of the latent process and the latent process to be influenced by realizations of the stochastic process. The results are thus applicable to adaptive modeling settings and also allows for sequential design of random vectors and matrices. We also discuss stochastic process based forms of J-L, RIP, and sketching, to illustrate the generality of the results.


Reviews: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Neural Information Processing Systems

UPDATE: I've read the other reviews and the rebuttal. I am keeping my score - this is a good paper. The study of Stochastic Gradient Descent in overparametrized setting is a popular and important trend in a recent development of huge-scale optimization for deep-learning. The authors propose a very basic and classical method, consisting from the well-known algorithmical blocks (SGD Armijo-type line search) together with its first theoretical justification under "interpolation assumption". The proof of convergence (for example, Theorem 2) mainly consists from the standard arguments (which are used for the proof of the classical non-stochastic Gradient Method under Lipschitz-continuous gradients).


Reviews: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Neural Information Processing Systems

This paper brings a classic idea into the present and makes progress on a vexing problem with SGD --- setting the step size. The authors provide theoretical evidence as well as emipirical evidence that their method is useful. The assumptions may be somewhat limiting; one version requires strong convexity and when that is relaxed, other assumptions must be made. But this work points to a path that may be useful in the long-run. An important way of contribution in ML is bridging fields; that could mean bringing in ideas that are state-of-the-art in other fields or it could mean revisiting classic ideas in new ways.


A Bellman Equations for Markov Games

Neural Information Processing Systems

In this section, we present the Bellman equations for different types of values in Markov games. Recall the definition for CCE in our main paper (4), we restate it here after rescaling. First, a CCE always exists since a Nash equilibrium for a generalsum game with payoff matrices (P, Q) is also a CCE defined by (P, Q), and a Nash equilibrium always exists. Third, a CCE in general-sum games needs not to be a Nash equilibrium. However, a CCE in zero-sum games is guaranteed to be a Nash equalibrium.


SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Neural Information Processing Systems

Stein Variational Gradient Descent (SVGD), a popular sampling algorithm, is often described as the kernelized gradient flow for the Kullback-Leibler divergence in the geometry of optimal transport. We introduce a new perspective on SVGD that instead views SVGD as the (kernelized) gradient flow of the chi-squared divergence which, we show, exhibits a strong form of uniform exponential ergodicity under conditions as weak as a Poincaré inequality. This perspective leads us to propose an alternative to SVGD, called Laplacian Adjusted Wasserstein Gradient Descent (LAWGD), that can be implemented from the spectral decomposition of the Laplacian operator associated with the target density. We show that LAWGD exhibits strong convergence guarantees and good practical performance.


Review for NeurIPS paper: SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Neural Information Processing Systems

Summary and Contributions: The paper makes the following contributions: 1) Interpretation (up to a constant factor of 2) of SVGD as (kernelized) gradient flow of the Chi-squared divergence, called as CSF 2) Establishing exponential ergodicity of CSF (continuous case) with respect to the KL metric and Chi-squared divergence metric, under certain Poincare condition (or LSI) on the target. Indeed this is an issue with any kernel method (from SVM to MMD to SVGD) and it has been addressed in various ways. If one were critical, there is still no "nice" way to pick a kernel. Indeed as mentioned in Line 16 and 17, a single integral operator depending on target \pi is good (in a way it is also along expected lines - for example in MMD context something similar leads to optimality properties). However I tend to not agree 100% with lines 27-28 that "solving high-dimensional PDEs is precisely the target of intensive research in modern numerical PDE" which is my main concern with the practical applicability of the proposed work. There is no "concrete" progress in this direction to the best of the reviewer's knowledge despite several ad-hoc approaches recently.


SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Neural Information Processing Systems

Stein Variational Gradient Descent (SVGD), a popular sampling algorithm, is often described as the kernelized gradient flow for the Kullback-Leibler divergence in the geometry of optimal transport. We introduce a new perspective on SVGD that instead views SVGD as the (kernelized) gradient flow of the chi-squared divergence which, we show, exhibits a strong form of uniform exponential ergodicity under conditions as weak as a Poincaré inequality. This perspective leads us to propose an alternative to SVGD, called Laplacian Adjusted Wasserstein Gradient Descent (LAWGD), that can be implemented from the spectral decomposition of the Laplacian operator associated with the target density. We show that LAWGD exhibits strong convergence guarantees and good practical performance.


Review for NeurIPS paper: SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Neural Information Processing Systems

The reviewers had some concerns about the empirical evaluation and the lack of discrete-time results, but agree that this paper would be a useful addition to NeurIPS. Please see the reviews (and your response) for ways to improve the final manuscript (especially in terms of clarifying the context and scope of this work).