Mathematical & Statistical Methods
Efficient Stochastic Gradient Hard Thresholding
Zhou, Pan, Yuan, Xiaotong, Feng, Jiashi
Stochastic gradient hard thresholding methods have recently been shown to work favorably in solving large-scale empirical risk minimization problems under sparsity or rank constraint. Despite the improved iteration complexity over full gradient methods, the gradient evaluation and hard thresholding complexity of the existing stochastic algorithms usually scales linearly with data size, which could still be expensive when data is huge and the hard thresholding step could be as expensive as singular value decomposition in rank-constrained problems. To address these deficiencies, we propose an efficient hybrid stochastic gradient hard thresholding (HSG-HT) method that can be provably shown to have sample-size-independent gradient evaluation and hard thresholding complexity bounds. Specifically, we prove that the stochastic gradient evaluation complexity of HSG-HT scales linearly with inverse of sub-optimality and its hard thresholding complexity scales logarithmically. By applying the heavy ball acceleration technique, we further propose an accelerated variant of HSG-HT which can be shown to have improved factor dependence on restricted condition number.
Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters
Allen-Zhu, Zeyuan, Yuan, Yang, Sridharan, Karthik
The amount of data available in the world is growing faster than our ability to deal with it. However, if we take advantage of the internal structure, data may become much smaller for machine learning purposes. In this paper we focus on one of the fundamental machine learning tasks, empirical risk minimization (ERM), and provide faster algorithms with the help from the clustering structure of the data. We introduce a simple notion of raw clustering that can be efficiently computed from the data, and propose two algorithms based on clustering information. Our accelerated algorithm ClusterACDM is built on a novel Haar transformation applied to the dual space of the ERM problem, and our variance-reduction based algorithm ClusterSVRG introduces a new gradient estimator using clustering.
Learning from Rational Behavior: Predicting Solutions to Unknown Linear Programs
Jabbari, Shahin, Rogers, Ryan M., Roth, Aaron, Wu, Steven Z.
We define and study the problem of predicting the solution to a linear program (LP) given only partial information about its objective and constraints. This generalizes the problem of learning to predict the purchasing behavior of a rational agent who has an unknown objective function, that has been studied under the name "Learning from Revealed Preferences". We give mistake bound learning algorithms in two settings: in the first, the objective of the LP is known to the learner but there is an arbitrary, fixed set of constraints which are unknown. Each example is defined by an additional known constraint and the goal of the learner is to predict the optimal solution of the LP given the union of the known and unknown constraints. This models the problem of predicting the behavior of a rational agent whose goals are known, but whose resources are unknown.
Variance Reduction in Stochastic Gradient Langevin Dynamics
Dubey, Kumar Avinava, Reddi, Sashank J., Williamson, Sinead A., Poczos, Barnabas, Smola, Alexander J., Xing, Eric P.
Stochastic gradient-based Monte Carlo methods such as stochastic gradient Langevin dynamics are useful tools for posterior inference on large scale datasets in many machine learning applications. These methods scale to large datasets by using noisy gradients calculated using a mini-batch or subset of the dataset. However, the high variance inherent in these noisy gradients degrades performance and leads to slower mixing. In this paper, we present techniques for reducing variance in stochastic gradient Langevin dynamics, yielding novel stochastic Monte Carlo methods that improve performance by reducing the variance in the stochastic gradient. We show that our proposed method has better theoretical guarantees on convergence rate than stochastic Langevin dynamics.
Online and Stochastic Gradient Methods for Non-decomposable Loss Functions
Kar, Purushottam, Narasimhan, Harikrishna, Jain, Prateek
Modern applications in sensitive domains such as biometrics and medicine frequently require the use of non-decomposable loss functions such as precision@k, F-measure etc. Compared to point loss functions such as hinge-loss, these offer much more fine grained control over prediction, but at the same time present novel challenges in terms of algorithm design and analysis. In this work we initiate a study of online learning techniques for such non-decomposable loss functions with an aim to enable incremental learning as well as design scalable solvers for batch problems. To this end, we propose an online learning framework for such loss functions. Our model enjoys several nice properties, chief amongst them being the existence of efficient online learning algorithms with sublinear regret and online to batch conversion bounds.
The Power of Graph Convolutional Networks to Distinguish Random Graph Models: Short Version
Magner, Abram, Baranwal, Mayank, Hero, Alfred O. III
Graph convolutional networks (GCNs) are a widely used method for graph representation learning. We investigate the power of GCNs, as a function of their number of layers, to distinguish between different random graph models on the basis of the embeddings of their sample graphs. In particular, the graph models that we consider arise from graphons, which are the most general possible parameterizations of infinite exchangeable graph models and which are the central objects of study in the theory of dense graph limits. We exhibit an infinite class of graphons that are well-separated in terms of cut distance and are indistinguishable by a GCN with nonlinear activation functions coming from a certain broad class if its depth is at least logarithmic in the size of the sample graph. These results theoretically match empirical observations of several prior works. Finally, we show a converse result that for pairs of graphons satisfying a degree profile separation property, a very simple GCN architecture suffices for distinguishability. To prove our results, we exploit a connection to random walks on graphs.
An Optimal Multistage Stochastic Gradient Method for Minimax Problems
Fallah, Alireza, Ozdaglar, Asuman, Pattathil, Sarath
In this paper, we study the minimax optimization problem in the smooth and strongly convex-strongly concave setting when we have access to noisy estimates of gradients. In particular, we first analyze the stochastic Gradient Descent Ascent (GDA) method with constant stepsize, and show that it converges to a neighborhood of the solution of the minimax problem. We further provide tight bounds on the convergence rate and the size of this neighborhood. Next, we propose a multistage variant of stochastic GDA (M-GDA) that runs in multiple stages with a particular learning rate decay schedule and converges to the exact solution of the minimax problem. We show M-GDA achieves the lower bounds in terms of noise dependence without any assumptions on the knowledge of noise characteristics. We also show that M-GDA obtains a linear decay rate with respect to the error's dependence on the initial error, although the dependence on condition number is suboptimal. In order to improve this dependence, we apply the multistage machinery to the stochastic Optimistic Gradient Descent Ascent (OGDA) algorithm and propose the M-OGDA algorithm which also achieves the optimal linear decay rate with respect to the initial error. To the best of our knowledge, this method is the first to simultaneously achieve the best dependence on noise characteristic as well as the initial error and condition number.
Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization
Horvรกth, Samuel, Lei, Lihua, Richtรกrik, Peter, Jordan, Michael I.
Adaptivity is an important yet under-studied property in modern optimization theory. The gap between the state-of-the-art theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step-size schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the "geometrization" technique introduced by Lei & Jordan 2016 and the \texttt{SARAH} algorithm of Nguyen et al., 2017, we propose the Geometrized \texttt{SARAH} algorithm for non-convex finite-sum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the Polyak-\L{}ojasiewicz (PL) constant if present. In addition, it achieves the best-available convergence rate for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.
Generalized Kernel-Based Dynamic Mode Decomposition
Heas, Patrick, Herzet, Cedric, Combes, Benoit
Reduced modeling in high-dimensional reproducing kernel Hilbert spaces offers the opportunity to approximate efficiently non-linear dynamics. In this work, we devise an algorithm based on low rank constraint optimization and kernel-based computation that generalizes a recent approach called "kernel-based dynamic mode decomposition". This new algorithm is characterized by a gain in approximation accuracy, as evidenced by numerical simulations, and in computational complexity.
On the Communication Latency of Wireless Decentralized Learning
We consider a wireless network comprising $n$ nodes located within a circular area of radius $R$, which are participating in a decentralized learning algorithm to optimize a global objective function using their local datasets. To enable gradient exchanges across the network, we assume each node communicates only with a set of neighboring nodes, which are within a distance $R n^{-\beta}$ of itself, where $\beta\in(0,\frac{1}{2})$. We use tools from network information theory and random geometric graph theory to show that the communication delay for a single round of exchanging gradients on all the links throughout the network scales as $\mathcal{O}\left(\frac{n^{2-3\beta}}{\beta\log n}\right)$, increasing (at different rates) with both the number of nodes and the gradient exchange threshold distance.