Mathematical & Statistical Methods
ODE Analysis of Stochastic Gradient Methods with Optimism and Anchoring for Minimax Problems and GANs
Ryu, Ernest K., Yuan, Kun, Yin, Wotao
Despite remarkable empirical success, the training dynamics of generative adversarial networks (GAN), which involves solving a minimax game using stochastic gradients, is still poorly understood. In this work, we analyze last-iterate convergence of simultaneous gradient descent (simGD) and its variants under the assumption of convex-concavity, guided by a continuous-time analysis with differential equations. First, we show that simGD, as is, converges with stochastic sub-gradients under strict convexity in the primal variable. Second, we generalize optimistic simGD to accommodate an optimism rate separate from the learning rate and show its convergence with full gradients. Finally, we present anchored simGD, a new method, and show convergence with stochastic subgradients.
Sparse Representation Classification via Screening for Graphs
Shen, Cencheng, Chen, Li, Dong, Yuexiao, Priebe, Carey
The sparse representation classifier (SRC) is shown to work well for image recognition problems that satisfy a subspace assumption. In this paper we propose a new implementation of SRC via screening, establish its equivalence to the original SRC under regularity conditions, and prove its classification consistency for random graphs drawn from stochastic blockmodels. The results are demonstrated via simulations and real data experiments, where the new algorithm achieves comparable numerical performance but significantly faster.
On stochastic gradient Langevin dynamics with dependent data streams: the fully non-convex case
Chau, Ngoc Huy, Moulines, รric, Rรกsonyi, Miklos, Sabanis, Sotirios, Zhang, Ying
We consider the problem of sampling from a target distribution which is \emph{not necessarily logconcave}. Non-asymptotic analysis results are established in a suitable Wasserstein-type distance of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, when the gradient is driven by even \emph{dependent} data streams. Our estimates are sharper and \emph{uniform} in the number of iterations, in contrast to those in previous studies.
Differentiable Sorting using Optimal Transport:The Sinkhorn CDF and Quantile Operator
Cuturi, Marco, Teboul, Olivier, Vert, Jean-Philippe
Sorting an array is a fundamental routine in machine learning, one that is used to compute rank-based statistics, cumulative distribution functions (CDFs), quantiles, or to select closest neighbors and labels. The sorting function is however piece-wise constant (the sorting permutation of a vector does not change if the entries of that vector are infinitesimally perturbed) and therefore has no gradient information to back-propagate. We propose a framework to sort elements that is algorithmically differentiable. We leverage the fact that sorting can be seen as a particular instance of the optimal transport (OT) problem on $\mathbb{R}$, from input values to a predefined array of sorted values (e.g. $1,2,\dots,n$ if the input array has $n$ elements). Building upon this link , we propose generalized CDFs and quantile operators by varying the size and weights of the target presorted array. Because this amounts to using the so-called Kantorovich formulation of OT, we call these quantities K-sorts, K-CDFs and K-quantiles. We recover differentiable algorithms by adding to the OT problem an entropic regularization, and approximate it using a few Sinkhorn iterations. We call these operators S-sorts, S-CDFs and S-quantiles, and use them in various learning settings: we benchmark them against the recently proposed neuralsort [Grover et al. 2019], propose applications to quantile regression and introduce differentiable formulations of the top-k accuracy that deliver state-of-the art performance.
Spectral Clustering of Signed Graphs via Matrix Power Means
Mercado, Pedro, Tudisco, Francesco, Hein, Matthias
Signed graphs encode positive (attractive) and negative (repulsive) relations between nodes. We extend spectral clustering to signed graphs via the one-parameter family of Signed Power Mean Laplacians, defined as the matrix power mean of normalized standard and signless Laplacians of positive and negative edges. We provide a thorough analysis of the proposed approach in the setting of a general Stochastic Block Model that includes models such as the Labeled Stochastic Block Model and the Censored Block Model. We show that in expectation the signed power mean Laplacian captures the ground truth clusters under reasonable settings where state-of-the-art approaches fail. Moreover, we prove that the eigenvalues and eigenvector of the signed power mean Laplacian concentrate around their expectation under reasonable conditions in the general Stochastic Block Model. Extensive experiments on random graphs and real world datasets confirm the theoretically predicted behaviour of the signed power mean Laplacian and show that it compares favourably with state-of-the-art methods.
Solving Empirical Risk Minimization in the Current Matrix Multiplication Time
Lee, Yin Tat, Song, Zhao, Zhang, Qiuyi
Many convex problems in machine learning and computer science share the same form: \begin{align*} \min_{x} \sum_{i} f_i( A_i x + b_i), \end{align*} where $f_i$ are convex functions on $\mathbb{R}^{n_i}$ with constant $n_i$, $A_i \in \mathbb{R}^{n_i \times d}$, $b_i \in \mathbb{R}^{n_i}$ and $\sum_i n_i = n$. This problem generalizes linear programming and includes many problems in empirical risk minimization. In this paper, we give an algorithm that runs in time \begin{align*} O^* ( ( n^{\omega} + n^{2.5 - \alpha/2} + n^{2+ 1/6} ) \log (n / \delta) ) \end{align*} where $\omega$ is the exponent of matrix multiplication, $\alpha$ is the dual exponent of matrix multiplication, and $\delta$ is the relative accuracy. Note that the runtime has only a log dependence on the condition numbers or other data dependent parameters and these are captured in $\delta$. For the current bound $\omega \sim 2.38$ [Vassilevska Williams'12, Le Gall'14] and $\alpha \sim 0.31$ [Le Gall, Urrutia'18], our runtime $O^* ( n^{\omega} \log (n / \delta))$ matches the current best for solving a dense least squares regression problem, a special case of the problem we consider. Very recently, [Alman'18] proved that all the current known techniques can not give a better $\omega$ below $2.168$ which is larger than our $2+1/6$. Our result generalizes the very recent result of solving linear programs in the current matrix multiplication time [Cohen, Lee, Song'19] to a more broad class of problems. Our algorithm proposes two concepts which are different from [Cohen, Lee, Song'19] : $\bullet$ We give a robust deterministic central path method, whereas the previous one is a stochastic central path which updates weights by a random sparse vector. $\bullet$ We propose an efficient data-structure to maintain the central path of interior point methods even when the weights update vector is dense.
Learning Erd\H{o}s-R\'enyi Random Graphs via Edge Detecting Queries
Li, Zihan, Fresacher, Matthias, Scarlett, Jonathan
In this paper, we consider the problem of learning an unknown graph via queries on groups of nodes, with the result indicating whether or not at least one edge is present among those nodes. We establish such bounds for a variety of algorithms inspired by the group testing problem, with explicit constant factors indicating a near-optimal number of tests, and in some cases asymptotic optimality including constant factors. I. INTRODUCTION Graphs are a ubiquitous tool in modern statistics and machine learning for depicting interactions, relations, and physical connections in networks, such as social networks, biological networks, sensor networks, and so on. Often, the graph is not known a priori, and must be learned via queries to the network. In this paper, we consider the problem of graph learning via edge detecting queries, where each query contains a subset of the nodes, and the binary outcome indicates whether or not there is at least one edge among these nodes. See Section IA for previous work on this problem. An application of this problem highlighted in previous works such as [15] is that of learning which chemicals react with each other, using tests that are able to detect whether any reaction occurs. Another potential application is learning connectivity in large wireless networks: Each node is given a unique identifier, and in response to a query, each node sends feedback to a central unit if both itself and one or more of its neigbors are included in that query.
Hitting Time of Stochastic Gradient Langevin Dynamics to Stationary Points: A Direct Analysis
Chen, Xi, Du, Simon S., Tong, Xin T.
Stochastic gradient Langevin dynamics (SGLD) is a fundamental algorithm in stochastic optimization. Recent work by Zhang et al. [2017] presents an analysis for the hitting time of SGLD for the first and second order stationary points. The proof in Zhang et al. [2017] is a two-stage procedure through bounding the Cheeger's constant, which is rather complicated and leads to loose bounds. In this paper, using intuitions from stochastic differential equations, we provide a direct analysis for the hitting times of SGLD to the first and second order stationary points. Our analysis is straightforward. It only relies on basic linear algebra and probability theory tools. Our direct analysis also leads to tighter bounds comparing to Zhang et al. [2017] and shows the explicit dependence of the hitting time on different factors, including dimensionality, smoothness, noise strength, and step size effects. Under suitable conditions, we show that the hitting time of SGLD to first-order stationary points can be dimension-independent. Moreover, we apply our analysis to study several important online estimation problems in machine learning, including linear regression, matrix factorization, and online PCA.
Convergence rates for ordinal embedding
Ellenberg, Jordan S., Jain, Lalit
We prove optimal bounds for the convergence rate of ordinal embedding (also known as non-metric multidimensional scaling) in the 1-dimensional case. The examples witnessing optimality of our bounds arise from a result in additive number theory on sets of integers with no three-term arithmetic progressions. We also carry out some computational experiments aimed at developing a sense of what the convergence rate for ordinal embedding might look like in higher dimensions.