Gradient Descent
Convergence Behaviour of Some Gradient-Based Methods on Bilinear Games
Min-max optimization has attracted much attention in the machine learning community due to the popularization of deep generative models and adversarial training. The optimization is quite different from traditional minimization analysis. For example, gradient descent does not converge in one of the simplest settings -- bilinear games. In this paper, we try to understand several gradient-based algorithms for bilinear min-max games: gradient descent, extra-gradient, optimistic gradient descent and the momentum method, for both simultaneous and alternating updates. We provide necessary and sufficient conditions for their convergence, with the Schur theorem. Furthermore, by extending these algorithms to more general parameter settings, we are able to optimize over larger parameter spaces to find the optimal convergence rates. Our results imply that alternating updates converge more easily in min-max games than simultaneous updates.
Adaptive Ensemble of Classifiers with Regularization for Imbalanced Data Classification
Wang, Chen, Yu, Qin, Luo, Ruisen, Hui, Dafeng, Zhou, Kai, Yu, Yanmei, Sun, Chao, Gong, Xiaofeng
Dynamic ensembling of classifiers is an effective approach in processing label-imbalanced classifications. However, in dynamic ensemble methods, the combination of classifiers is usually determined by the local competence and conventional regularization methods are difficult to apply, leaving the technique prone to overfitting. In this paper, focusing on the binary label-imbalanced classification field, a novel method of Adaptive Ensemble of classifiers with Regularization (AER) has been proposed. The method deals with the overfitting problem from a perspective of implicit regularization. Specifically, it leverages the properties of Stochastic Gradient Descent (SGD) to obtain the solution with the minimum norm to achieve regularization, and interpolates ensemble weights via the global geometry of data to further prevent overfitting. The method enjoys a favorable time and memory complexity, and theoretical proofs show that algorithms implemented with AER paradigm have time and memory complexities upper-bounded by their original implementations. Furthermore, the proposed AER method is tested with a specific implementation based on Gradient Boosting Machine (XGBoost) on the three datasets: UCI Bioassay, KEEL Abalone19, and a set of GMM-sampled artificial dataset. Results show that the proposed AER algorithm can outperform the major existing algorithms based on multiple metrics, and Mcnemar's tests are applied to validate performance superiorities. To summarize, this work complements regularization for dynamic ensemble methods and develops an algorithm superior in grasping both the global and local geometry of data to alleviate overfitting in imbalanced data classification.
Zen of Stochastic Gradient Descent Principle
We are living in an age where almost any information is just a swipe away. There are innumerable blogs, videos, articles, papers and even podcasts on any topic that you want to consume information on. There are so many school thoughts in each and a lot of information on each in the internet. You are kinda getting the gist of where I am getting at. There is just so much information out there that it causes analysis paralysis and we become just consumers of information with nothing to act on.
Cross-Domain Collaborative Filtering via Translation-based Learning
With the proliferation of social media platforms and e-commerce sites, several cross-domain collaborative filtering strategies have been recently introduced to transfer the knowledge of user preferences across domains. The main challenge of cross-domain recommendation is to weigh and learn users' different behaviors in multiple domains. In this paper, we propose a Cross-Domain collaborative filtering model following a Translation-based strategy, namely CDT. In our model, we learn the embedding space with translation vectors and capture high-order feature interactions in users' multiple preferences across domains. In doing so, we efficiently compute the transitivity between feature latent embeddings, that is if feature pairs have high interaction weights in the latent space, then feature embeddings with no observed interactions across the domains will be closely related as well. We formulate our objective function as a ranking problem in factorization machines and learn the model's parameters via gradient descent. In addition, to better capture the non-linearity in user preferences across domains we extend the proposed CDT model by using a deep learning strategy, namely DeepCDT. Our experiments on six publicly available cross-domain tasks demonstrate the effectiveness of the proposed models, outperforming other state-of-the-art cross-domain strategies.
Model inference for Ordinary Differential Equations by parametric polynomial kernel regression
Green, David K. E., Rindler, Filip
Model inference for dynamical systems aims to estimate the future behaviour of a system from observations. Purely model-free statistical methods, such as Artificial Neural Networks, tend to perform poorly for such tasks. They are therefore not well suited to many questions from applications, for example in Bayesian filtering and reliability estimation. This work introduces a parametric polynomial kernel method that can be used for inferring the future behaviour of Ordinary Differential Equation models, including chaotic dynamical systems, from observations. Using numerical integration techniques, parametric representations of Ordinary Differential Equations can be learnt using Backpropagation and Stochastic Gradient Descent. The polynomial technique presented here is based on a nonparametric method, kernel ridge regression. However, the time complexity of nonparametric kernel ridge regression scales cubically with the number of training data points. Our parametric polynomial method avoids this manifestation of the curse of dimensionality, which becomes particularly relevant when working with large time series data sets. Two numerical demonstrations are presented. First, a simple regression test case is used to illustrate the method and to compare the performance with standard Artificial Neural Network techniques. Second, a more substantial test case is the inference of a chaotic spatio-temporal dynamical system, the Lorenz--Emanuel system, from observations. Our method was able to successfully track the future behaviour of the system over time periods much larger than the training data sampling rate. Finally, some limitations of the method are presented, as well as proposed directions for future work to mitigate these limitations.
Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes
Kawaguchi, Kenji, Huang, Jiaoyang
In this paper, we theoretically prove that gradient descent can find a global minimum for nonlinear deep neural networks of sizes commonly encountered in practice. The theory developed in this paper requires only the number of trainable parameters to increase linearly as the number of training samples increases. This allows the size of the deep neural networks to be several orders of magnitude smaller than that required by the previous theories. Moreover, we prove that the linear increase of the size of the network is the optimal rate and that it cannot be improved, except by a logarithmic factor. Furthermore, deep neural networks with the trainability guarantee are shown to generalize well to unseen test samples with a natural dataset but not a random dataset.
Extending the step-size restriction for gradient descent to avoid strict saddle points
Schaeffer, Hayden, McCalla, Scott G.
We provide larger step-size restrictions for which gradient descent based algorithms (almost surely) avoid strict saddle points. In particular, consider a twice differentiable (non-convex) objective function whose gradient has Lipschitz constant L and whose Hessian is well-behaved. We prove that the probability of initial conditions for gradient descent with step-size up to 2/L converging to a strict saddle point, given one uniformly random initialization, is zero. This extends previous results up to the sharp limit imposed by the convex case. In addition, the arguments hold in the case when a learning rate schedule is given, with either a continuous decaying rate or a piece-wise constant schedule.
Path Length Bounds for Gradient Descent and Flow
Gupta, Chirag, Balakrishnan, Sivaraman, Ramdas, Aaditya
We provide path length bounds on gradient descent (GD) and flow (GF) curves for various classes of smooth convex and nonconvex functions. We make six distinct contributions: (a) we prove a meta-theorem that if GD has linear convergence towards an optimal set, then its path length is upper bounded by the distance to the optimal set multiplied by a function of the rate of convergence, (b) under the Polyak-Lojasiewicz (PL) condition (a generalization of strong convexity that allows for certain nonconvex functions), we show that the aforementioned multiplicative factor is at most $\sqrt{\kappa}$, (c) we show an $\widetilde\Omega(\sqrt{d} \wedge \kappa^{1/4})$, times the length of the direct path, lower bound on the worst-case path length for PL functions, (d) for the special case of quadratics, we show that the bound is $\Theta(\min\{\sqrt{d},\sqrt{\log \kappa}\})$ and in some cases can be independent of $\kappa$, (e) under the weaker assumption of just convexity, where there is no natural notion of a condition number, we prove that the path length can be at most $2^{10d^2}$ times the length of the direct path, (f) finally, for separable quasiconvex functions the path length is both upper and lower bounded by ${\Theta}(\sqrt{d})$ times the length of the direct path.
Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance
Tong, Qianqian, Liang, Guannan, Bi, Jinbo
Although adaptive gradient methods (AGMs) have fast speed in training deep neural networks, it is known to generalize worse than the stochastic gradient descent (SGD) or SGD with momentum (S-Momentum). Many works have attempted to modify AGMs so to close the gap in generalization performance between AGMs and S-Momentum, but they do not answer why there is such a gap. We identify that the anisotropic scale of the adaptive learning rate (A-LR) used by AGMs contributes to the generalization performance gap, and all existing modified AGMs actually represent efforts in revising the A-LR. Because the A-LR varies significantly across the dimensions of the problem over the optimization epochs (i.e., anisotropic scale), we propose a new AGM by calibrating the A-LR with a {\em softplus} function, resulting in the \textsc{Sadam} and \textsc{SAMSGrad} methods\footnote{Code is available at https://github.com/neilliang90/Sadam.git.}. These methods have better chance to not trap at sharp local minimizers, which helps them resume the dips in the generalization error curve observed with SGD and S-Momentum. We further provide a new way to analyze the convergence of AGMs (e.g., \textsc{Adam}, \textsc{Sadam}, and \textsc{SAMSGrad}) under the nonconvex, non-strongly convex, and Polyak-{\L}ojasiewicz conditions. We prove that the convergence rate of ADAM also depends on its hyper-parameter epsilon, which has been overlooked in prior convergence analysis. Empirical studies support our observation of the anisotropic A-LR and show that the proposed methods outperform existing AGMs and generalize even better than S-Momentum in multiple deep learning tasks.
How Good is SGD with Random Shuffling?
We study the performance of stochastic gradient descent (SGD) on smooth and strongly-convex finite-sum optimization problems. In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions. This setting has been investigated in several recent works, but the optimal error rates remains unclear. In this paper, we provide lower bounds on the expected optimization error with these heuristics (using SGD with any constant step size), which elucidate their advantages and disadvantages. In particular, we prove that after $k$ passes over $n$ individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $\Omega\left(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds, and we conjecture to be tight. Moreover, if the functions are only shuffled once, then the lower bound increases to $\Omega(1/nk^2)$. Since there are strictly smaller upper bounds for random reshuffling, this proves an inherent performance gap between SGD with single shuffling and repeated shuffling. As a more minor contribution, we also provide a non-asymptotic $\Omega(1/k^2)$ lower bound (independent of $n$) for cyclic gradient descent, where no random shuffling takes place.