Goto

Collaborating Authors

 Mathematical & Statistical Methods


Globally Q-linear Gauss-Newton Method for Overparameterized Non-convex Matrix Sensing Defeng Sun

Neural Information Processing Systems

This paper focuses on the optimization of overparameterized, non-convex low-rank matrix sensing (LRMS)--an essential component in contemporary statistics and machine learning. Recent years have witnessed significant breakthroughs in firstorder methods, such as gradient descent, for tackling this non-convex optimization problem. However, the presence of numerous saddle points often prolongs the time required for gradient descent to overcome these obstacles.


An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods Tamer Başar Wotao Yin

Neural Information Processing Systems

In this paper, we revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations. More specifically, with the Fisher information matrix of the policy being positive definite: i) we show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error due to policy parametrization; ii) we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG, which incorporates variancereduction into the NPG update. Our improvements follow from an observation that the convergence of (variance-reduced) PG and NPG methods can improve each other: the stationary convergence analysis of PG can be applied to NPG as well, and the global convergence analysis of NPG can help to establish the global convergence of (variance-reduced) PG methods.


Finding Bipartite Components in Hypergraphs Supplementary Material

Neural Information Processing Systems

In this section we prove Theorem 2. After giving some additional preliminaries, and discussing the rules of the diffusion process, we will construct a linear program which can compute the rate of change r satisfying the rules of the diffusion process. We then give a complete analysis of the new linear program which establishes Theorem 2. A.1 Additional preliminaries Given a hypergraph H = (V A.2 Counter-example showing that rule (2) is needed First, we recall the rules which the rate of change of the diffusion process must satisfy. One might have expected that Rules (0) and (1) together would define a unique process. In such a scenario, either {u, w} or {v, w} can participate in the diffusion and satisfy Rule (0), which makes the process not uniquely defined and so we introduce Rule (2) to ensure that there will be a unique vector r which satisfies the rules. A.3 Computing r by a linear program Now we present an algorithm that computes the vector r = df Next we study every equivalence class U U in turn, and will set the r-value of the vertices in U recursively.


Quasi-Newton Methods for Saddle Point Problems Luo

Neural Information Processing Systems

The design and analysis of proposed algorithm are based on estimating the square of indefinite Hessian matrix, which is different from classical quasi-Newton methods in convex optimization.


On the Universality of Graph Neural Networks on Large Random Graphs

Neural Information Processing Systems

We study the approximation power of Graph Neural Networks (GNNs) on latent position random graphs. In the large graph limit, GNNs are known to converge to certain "continuous" models known as c-GNNs, which directly enables a study of their approximation power on random graph models. In the absence of input node features however, just as GNNs are limited by the Weisfeiler-Lehman isomorphism test, c-GNNs will be severely limited on simple random graph models. For instance, they will fail to distinguish the communities of a well-separated Stochastic Block Model (SBM) with constant degree function. Thus, we consider recently proposed architectures that augment GNNs with unique node identifiers, referred to as Structural GNNs here (SGNNs). We study the convergence of SGNNs to their continuous counterpart (c-SGNNs) in the large random graph limit, under new conditions on the node identifiers. We then show that c-SGNNs are strictly more powerful than c-GNNs in the continuous limit, and prove their universality on several random graph models of interest, including most SBMs and a large class of random geometric graphs. Our results cover both permutation-invariant and permutation-equivariant architectures.


Greedy and Random Quasi-Newton Methods with Faster Explicit Superlinear Convergence Dachao Lin 1 Haishan Ye2 Academy for Advanced Interdisciplinary Studies, Peking University

Neural Information Processing Systems

In this paper, we follow Rodomanov and Nesterov [19]'s work to study quasi-Newton methods. We focus on the common SR1 and BFGS quasi-Newton methods to establish better explicit (local) superlinear convergence rates. First, based on the greedy quasi-Newton update which greedily selects the direction to maximize a certain measure of progress, we improve the convergence rate to a conditionnumber-free superlinear convergence rate. Second, based on the random quasi-Newton update that selects the direction randomly from a spherically symmetric distribution, we show the same superlinear convergence rate established as above. Our analysis is closely related to the approximation of a given Hessian matrix, unconstrained quadratic objective, as well as the general strongly convex, smooth and strongly self-concordant functions.


On the Exploration of Local Significant Differences For Two-Sample Test

Neural Information Processing Systems

Recent years have witnessed increasing attentions on two-sample test with diverse real applications, while this work takes one more step on the exploration of local significant differences for two-sample test.


Generalization Bound and Learning Methods for Data-Driven Projections in Linear Programming

Neural Information Processing Systems

How to solve high-dimensional linear programs (LPs) efficiently is a fundamental question. Recently, there has been a surge of interest in reducing LP sizes using random projections, which can accelerate solving LPs independently of improving LP solvers. This paper explores a new direction of data-driven projections, which use projection matrices learned from data instead of random projection matrices. Given training data of n-dimensional LPs, we learn an n k projection matrix with n > k. When addressing a future LP instance, we reduce its dimensionality from n to k via the learned projection matrix, solve the resulting LP to obtain a k-dimensional solution, and apply the learned matrix to it to recover an n-dimensional solution. On the theoretical side, a natural question is: how much data is sufficient to ensure the quality of recovered solutions? We address this question based on the framework of data-driven algorithm design, which connects the amount of data sufficient for establishing generalization bounds to the pseudo-dimension of performance metrics.


A Bellman Equations for Markov Games

Neural Information Processing Systems

In this section, we present the Bellman equations for different types of values in Markov games. Recall the definition for CCE in our main paper (4), we restate it here after rescaling. First, a CCE always exists since a Nash equilibrium for a generalsum game with payoff matrices (P, Q) is also a CCE defined by (P, Q), and a Nash equilibrium always exists. Third, a CCE in general-sum games needs not to be a Nash equilibrium. However, a CCE in zero-sum games is guaranteed to be a Nash equalibrium.


SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Neural Information Processing Systems

Stein Variational Gradient Descent (SVGD), a popular sampling algorithm, is often described as the kernelized gradient flow for the Kullback-Leibler divergence in the geometry of optimal transport. We introduce a new perspective on SVGD that instead views SVGD as the (kernelized) gradient flow of the chi-squared divergence which, we show, exhibits a strong form of uniform exponential ergodicity under conditions as weak as a Poincaré inequality. This perspective leads us to propose an alternative to SVGD, called Laplacian Adjusted Wasserstein Gradient Descent (LAWGD), that can be implemented from the spectral decomposition of the Laplacian operator associated with the target density. We show that LAWGD exhibits strong convergence guarantees and good practical performance.