Gradient Descent
A Stochastic First-Order Method for Ordered Empirical Risk Minimization
We propose a new stochastic first-order method for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (mini-batch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses, and that itself is an unbiased gradient estimator of an ordered modification of the empirical average loss. On the theory side, we show that the proposed algorithm is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for non-convex loss. Furthermore, we prove a new generalization bound for the proposed algorithm. On the empirical side, we present extensive numerical experiments, in which our proposed method consistently improves the test errors compared with the standard mini-batch SGD in various models including SVM, logistic regression, and (non-convex) deep learning problems.
Unified Optimal Analysis of the (Stochastic) Gradient Method
In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $\mu$-strongly convex functions under a (milder than standard) $L$-smoothness assumption. We show that SGD converges after $T$ iterations as $O\left( L \|x_0-x^\star\|^2 \exp \bigl[-\frac{\mu}{4L}T \bigr] + \frac{\sigma^2}{\mu T} \right)$ where $\sigma^2$ measures the variance. For deterministic gradient descent (GD) and SGD in the interpolation setting we have $\sigma^2 =0$ and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Zhang, Guodong, Li, Lala, Nado, Zachary, Martens, James, Sachdeva, Sushant, Dahl, George E., Shallue, Christopher J., Grosse, Roger
Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.
SVGD: A Virtual Gradients Descent Method for Stochastic Optimization
Inspired by dynamic programming, we propose Stochastic Virtual Gradient Descent (SVGD) algorithm where the Virtual Gradient is defined by computational graph and automatic differentiation. The method is computationally efficient and has little memory requirements. We also analyze the theoretical convergence properties and implementation of the algorithm. Experimental results on multiple datasets and network models show that SVGD has advantages over other stochastic optimization methods.
A Hybrid Stochastic Optimization Framework for Stochastic Composite Nonconvex Optimization
Tran-Dinh, Quoc, Pham, Nhan H., Phan, Dzung T., Nguyen, Lam M.
In this paper, we introduce a new approach to develop stochastic optimization algorithms for solving stochastic composite and possibly nonconvex optimization problems. The main idea is to combine two stochastic estimators to form a new hybrid one. We first introduce our hybrid estimator and then investigate its fundamental properties to form a foundation theory for algorithmic development. Next, we apply our theory to develop several variants of stochastic gradient methods to solve both expectation and finite-sum composite optimization problems. Our first algorithm can be viewed as a variant of proximal stochastic gradient methods with a single-loop, but can achieve $\mathcal{O}(\sigma^3\varepsilon^{-1} + \sigma\varepsilon^{-3})$ complexity bound that is significantly better than the $\mathcal{O}(\sigma^2\varepsilon^{-4})$-complexity in state-of-the-art stochastic gradient methods, where $\sigma$ is the variance and $\varepsilon$ is a desired accuracy. Then, we consider two different variants of our method: adaptive step-size and double-loop schemes that have the same theoretical guarantees as in our first algorithm. We also study two mini-batch variants and develop two hybrid SARAH-SVRG algorithms to solve the finite-sum problems. In all cases, we achieve the best-known complexity bounds under standard assumptions. We test our methods on several numerical examples with real datasets and compare them with state-of-the-arts. Our numerical experiments show that the new methods are comparable and, in many cases, outperform their competitors.
Circuit-Based Intrinsic Methods to Detect Overfitting
Chatterjee, Sat, Mishchenko, Alan
The focus of this paper is on intrinsic methods to detect overfitting. These rely only on the model and the training data, as opposed to traditional extrinsic methods that rely on performance on a test set or on bounds from model complexity. We propose a family of intrinsic methods called Counterfactual Simulation (CFS) which analyze the flow of training examples through the model by identifying and perturbing rare patterns. By applying CFS to logic circuits we get a method that has no hyper-parameters and works uniformly across different types of models such as neural networks, random forests and lookup tables. Experimentally, CFS can separate models with different levels of overfit using only their logic circuit representations without any access to the high level structure. By comparing lookup tables, neural networks, and random forests using CFS, we get insight into why neural networks generalize. In particular, we find that stochastic gradient descent in neural nets does not lead to "brute force" memorization, but finds common patterns (whether we train with actual or randomized labels), and neural networks are not unlike forests in this regard. Finally, we identify a limitation with our proposal that makes it unsuitable in an adversarial setting, but points the way to future work on robust intrinsic methods.
On the Convergence of FedAvg on Non-IID Data
Li, Xiang, Huang, Kaixuan, Yang, Wenhao, Wang, Shusen, Zhang, Zhihua
Federated learning enables a large amount of edge computing devices to learn a centralized model while keeping all local data on edge devices. As a leading algorithm in this setting, Federated Averaging (\texttt{FedAvg}) runs Stochastic Gradient Descent (SGD) in parallel on a small subset of the total devices and averages the sequences only once in a while. Despite its simplicity, it lacks theoretical guarantees in the federated setting. In this paper, we analyze the convergence of \texttt{FedAvg} on non-iid data. We investigate the effect of different sampling and averaging schemes, which are crucial especially when data are unbalanced. We prove a concise convergence rate of $\mathcal{O}(\frac{1}{T})$ for \texttt{FedAvg} with proper sampling and averaging schemes in convex problems, where $T$ is the total number of steps. Our results show that heterogeneity of data slows down the convergence, which is intrinsic in the federated setting. Low device participation rate can be achieved without severely harming the optimization process in federated learning. We show that there is a trade-off between communication efficiency and convergence rate. We analyze the necessity of learning rate decay by taking a linear regression as an example. Our work serves as a guideline for algorithm design in applications of federated learning, where heterogeneity and unbalance of data are the common case.
Quickly Finding the Best Linear Model in High Dimensions
We study the problem of finding the best linear model that can minimize least-squares loss given a data-set. While this problem is trivial in the low dimensional regime, it becomes more interesting in high dimensions where the population minimizer is assumed to lie on a manifold such as sparse vectors. We propose projected gradient descent (PGD) algorithm to estimate the population minimizer in the finite sample regime. We establish linear convergence rate and data dependent estimation error bounds for PGD. Our contributions include: 1) The results are established for heavier tailed sub-exponential distributions besides sub-gaussian. 2) We directly analyze the empirical risk minimization and do not require a realizable model that connects input data and labels. 3) Our PGD algorithm is augmented to learn the bias terms which boosts the performance. The numerical experiments validate our theoretical results.
Music artist Recommender System using Stochastic Gradient Descent Machine Learning from Scratch…
Recommender Systems are becoming more and more relevant as the amount of information on "The Internets" is exponentially increasing: Finding what you might enjoy from books, movies, games to who to follow on Instagram becomes increasingly difficult. Moreover, we (the users) require faster interactions with the products we use on a daily basis, so we don't feel like we waste time (even though we do it more than ever before in human history). As the amounts of data increase, your computational resources might have a hard time to produce fast enough results. Here, we'll have a look at a succinct implementation of a Recommender System that is both the basis of many real-world implementations and is easy to understand. Traditionally, recommender systems are built around user ratings given for a set of items, e.g.
Searching for Interaction Functions in Collaborative Filtering
Yao, Quanming, Chen, Xiangning, Kwok, James, Li, Yong
Interaction function (IFC), which captures interactions among items and users, is of great importance in collaborative filtering (CF). The inner product is the most popular IFC due to its success in low-rank matrix factorization. However, interactions in real-world applications can be highly complex. Many other operations (such as plus and concatenation) have also been proposed, and can possibly offer better performance than the inner product. In this paper, motivated by the success of automated machine learning, we propose to search for proper interaction functions (SIF) for CF tasks. We first design an expressive search space for SIF by reviewing and generalizing existing CF approaches. We then propose to represent the search space as a structured multi-layer perceptron, and design a stochastic gradient descent algorithm which can simultaneously update both architectures and learning parameters. Experimental results demonstrate that the proposed method can be much more efficient than popular AutoML approaches, and also obtain much better prediction performance than state-of-the-art CF approaches.