Goto

Collaborating Authors

 Gradient Descent


Gradient descent with momentum --- to accelerate or to super-accelerate?

arXiv.org Machine Learning

We consider gradient descent with `momentum', a widely used method for loss function minimization in machine learning. This method is often used with `Nesterov acceleration', meaning that the gradient is evaluated not at the current position in parameter space, but at the estimated position after one step. In this work, we show that the algorithm can be improved by extending this `acceleration' --- by using the gradient at an estimated position several steps ahead rather than just one step ahead. How far one looks ahead in this `super-acceleration' algorithm is determined by a new hyperparameter. Considering a one-parameter quadratic loss function, the optimal value of the super-acceleration can be exactly calculated and analytically estimated. We show explicitly that super-accelerating the momentum algorithm is beneficial, not only for this idealized problem, but also for several synthetic loss landscapes and for the MNIST classification task with neural networks. Super-acceleration is also easy to incorporate into adaptive algorithms like RMSProp or Adam, and is shown to improve these algorithms.


A Better Bound Gives a Hundred Rounds: Enhanced Privacy Guarantees via $f$-Divergences

arXiv.org Machine Learning

We derive the optimal differential privacy (DP) parameters of a mechanism that satisfies a given level of R enyi differential privacy (RDP). Our result is based on the joint range of two f -divergences that underlie the approximate and the R enyi variations of differential privacy. We apply our result to the moments accountant framework for characterizing privacy guarantees of stochastic gradient descent. When compared to the state-of-the-art, our bounds may lead to about 100 more stochastic gradient descent iterations for training deep learning models for the same privacy budget. Differential privacy (DP) [1] has become the de facto standard for privacy-preserving data analytics. Intuitively, a (potentially randomized) algorithm is said to be differentially private if its output does not vary significantly with small perturbations of the input. DP guarantees are usually cast in terms of properties of the information density [2] of the output of the algorithm conditioned on a given input--referred to as the privacy loss variable in the DP literature.


Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

arXiv.org Machine Learning

Machine learning has made tremendous progress in recent years, with models matching or even surpassing humans on a series of specialized tasks. One key element behind the progress of machine learning in recent years has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Many of these models are trained employing variants of stochastic gradient descent (SGD) based optimization. In this paper, we introduce a general consistency condition covering communication-reduced and asynchronous distributed SGD implementations. Our framework, called elastic consistency enables us to derive convergence bounds for a variety of distributed SGD methods used in practice to train large-scale machine learning models. The proposed framework de-clutters the implementation-specific convergence analysis and provides an abstraction to derive convergence bounds. We utilize the framework to analyze a sparsification scheme for distributed SGD methods in an asynchronous setting for convex and non-convex objectives. We implement the distributed SGD variant to train deep CNN models in an asynchronous shared-memory setting. Empirical results show that error-feedback may not necessarily help in improving the convergence of sparsified asynchronous distributed SGD, which corroborates an insight suggested by our convergence analysis.


Adaptive fractional order graph neural network

#artificialintelligence

This paper proposes adaptive fractional order graph neural network (AFGNN), optimized by a time-varying fractional order gradient descent method to address the challenges of local optimum of classic and fractional GNNs which are specialised at aggregating information from the feature and adjacent matrices of connected nodes and their neighbours to solve learning tasks on non-Euclidean data such as graphs. To overcome the high computational complexity of fractional order derivations, the proposed model approximately calculates the fractional order gradients. We further prove such approximation is feasible and the AFGNN is unbiased. Extensive experiments on benchmark citation networks and object recognition challenges confirm the performance of AFGNN. The first group of experiments show that the results of AFGNN outperform the steepest gradient based method and conventional GNNs on the citation networks.


Learning a Single Neuron with Gradient Methods

arXiv.org Machine Learning

We consider the fundamental problem of learning a single neuron $x \mapsto\sigma(w^\top x)$ using standard gradient methods. As opposed to previous works, which considered specific (and not always realistic) input distributions and activation functions $\sigma(\cdot)$, we ask whether a more general result is attainable, under milder assumptions. On the one hand, we show that some assumptions on the distribution and the activation function are necessary. On the other hand, we prove positive guarantees under mild assumptions, which go beyond those studied in the literature so far. We also point out and study the challenges in further strengthening and generalizing our results.


On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width

arXiv.org Machine Learning

The Hessian of neural networks can be decomposed into a sum of two matrices: (i) the positive semidefinite generalized Gauss-Newton matrix G, and (ii) the matrix H containing negative eigenvalues. We observe that for wider networks, minimizing the loss with the gradient descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G. To explain this phenomenon, we show that when initialized using common methodologies, the gradients of over-parameterized networks are approximately orthogonal to H, such that the curvature of the loss surface is strictly positive in the direction of the gradient.


Stochastic Recursive Gradient Descent Ascent for Stochastic Nonconvex-Strongly-Concave Minimax Problems

arXiv.org Machine Learning

We consider nonconvex-concave minimax problems of the form $\min_{\bf x}\max_{\bf y} f({\bf x},{\bf y})$, where $f$ is strongly-concave in $\bf y$ but possibly nonconvex in $\bf x$. We focus on the stochastic setting, where we can only access an unbiased stochastic gradient estimate of $f$ at each iteration. This formulation includes many machine learning applications as special cases such as adversary training and certifying robustness in deep learning. We are interested in finding an ${\mathcal O}(\varepsilon)$-stationary point of the function $\Phi(\cdot)=\max_{\bf y} f(\cdot, {\bf y})$. The most popular algorithm to solve this problem is stochastic gradient decent ascent, which requires $\mathcal O(\kappa^3\varepsilon^{-4})$ stochastic gradient evaluations, where $\kappa$ is the condition number. In this paper, we propose a novel method called Stochastic Recursive gradiEnt Descent Ascent (SREDA), which estimates gradients more efficiently using variance reduction. This method achieves the best known stochastic gradient complexity of ${\mathcal O}(\kappa^3\varepsilon^{-3})$, and its dependency on $\varepsilon$ is optimal for this problem.


Choosing the Sample with Lowest Loss makes SGD Robust

arXiv.org Machine Learning

The presence of outliers can potentially significantly skew the parameters of machine learning models trained via stochastic gradient descent (SGD). In this paper we propose a simple variant of the simple SGD method: in each step, first choose a set of k samples, then from these choose the one with the smallest current loss, and do an SGD-like update with this chosen sample. Vanilla SGD corresponds to k = 1, i.e. no choice; k >= 2 represents a new algorithm that is however effectively minimizing a non-convex surrogate loss. Our main contribution is a theoretical analysis of the robustness properties of this idea for ML problems which are sums of convex losses; these are backed up with linear regression and small-scale neural network experiments


Adaptive Stopping Rule for Kernel-based Gradient Descent Algorithms

arXiv.org Machine Learning

In this paper, we propose an adaptive stopping rule for kernel-based gradient descent (KGD) algorithms. We introduce the empirical effective dimension to quantify the increments of iterations in KGD and derive an implementable early stopping strategy. We analyze the performance of the adaptive stopping rule in the framework of learning theory. Using the recently developed integral operator approach, we rigorously prove the opti-mality of the adaptive stopping rule in terms of showing the optimal learning rates for KGD equipped with this rule. Furthermore, a sharp bound on the number of iterations in KGD equipped with the proposed early stopping rule is also given to demonstrate its computational advantage. Introduction In financial studies, clinical medicine, gene analysis and engineering applications, data of input-output pairs are collected to pursue the relation between input and output. Kernel methods [14], which map input points from the input space to some feature space and then synthesize the estimator in the feature space, have been widely used for this purpose. The research was supported by the National Natural Science Foundation of China [Grant Nos. To overcome their computational and storage bottlenecks, numerous techniques (e,g, the distributed learning [41, 22], localized learning [27] and random sketching [17, 39]) have been further developed to equip kernel methods in the recent era of big data. KGD, as a popular realization of kernel methods, succeeds in avoiding the saturation of KRR in theory [16], reducing the computational burden of KPCA in computation [13], and benefiting in scalability when compared with KPLS and KCG [23]. Therefore, it has been widely used in regression [40], classification [37] and minimum error entropy principle analysis [21].


Backtracking Gradient Descent allowing unbounded learning rates

arXiv.org Machine Learning

In unconstrained optimisation on an Euclidean space, to prove convergence in Gradient Descent processes (GD) $x_{n+1}=x_n-\delta _n \nabla f(x_n)$ it usually is required that the learning rates $\delta _n$'s are bounded: $\delta _n\leq \delta $ for some positive $\delta $. Under this assumption, if the sequence $x_n$ converges to a critical point $z$, then with large values of $n$ the update will be small because $||x_{n+1}-x_n||\lesssim ||\nabla f(x_n)||$. This may also force the sequence to converge to a bad minimum. If we can allow, at least theoretically, that the learning rates $\delta _n$'s are not bounded, then we may have better convergence to better minima. A previous joint paper by the author showed convergence for the usual version of Backtracking GD under very general assumptions on the cost function $f$. In this paper, we allow the learning rates $\delta _n$ to be unbounded, in the sense that there is a function $h:(0,\infty)\rightarrow (0,\infty )$ such that $\lim _{t\rightarrow 0}th(t)=0$ and $\delta _n\lesssim \max \{h(x_n),\delta \}$ satisfies Armijo's condition for all $n$, and prove convergence under the same assumptions as in the mentioned paper. It will be shown that this growth rate of $h$ is best possible if one wants convergence of the sequence $\{x_n\}$. A specific way for choosing $\delta _n$ in a discrete way connects to Two-way Backtracking GD defined in the mentioned paper. We provide some results which either improve or are implicitly contained in those in the mentioned paper and another recent paper on avoidance of saddle points.