Goto

Collaborating Authors

 Gradient Descent


I wasn't getting hired as a Data Scientist. So I sought data on who is.

#artificialintelligence

At the time I'm writing this, every single trending article in my Towards Data Science home page is talking about applying or learning a particular skill in data science. At the top are big-picture skills such as How to Work With Stakeholders as a Data Scientist and How to Become a Data Engineer, followed by a litany of very specific skills including technical primers on Batch Gradient Descent vs. Stochastic Gradient Descent, Multi-Class Text Classification, Faster R-CNN, et cetera. As a dedicated Medium platform for "sharing concepts, ideas, and codes" in data science, it is not surprising that such learning resources attain high popularity amongst Towards Data Science followers, who are probably navigating data-centric projects and professions. But to a novice looking to prioritize what is essential, it can quickly become daunting. Should one train to become a master Kaggler?


An Implicit Form of Krasulina's k-PCA Update without the Orthonormality Constraint

arXiv.org Machine Learning

We shed new insights on the two commonly used updates for the online $k$-PCA problem, namely, Krasulina's and Oja's updates. We show that Krasulina's update corresponds to a projected gradient descent step on the Stiefel manifold of the orthonormal $k$-frames, while Oja's update amounts to a gradient descent step using the unprojected gradient. Following these observations, we derive a more \emph{implicit} form of Krasulina's $k$-PCA update, i.e. a version that uses the information of the future gradient as much as possible. Most interestingly, our implicit Krasulina update avoids the costly QR-decomposition step by bypassing the orthonormality constraint. We show that the new update in fact corresponds to an online EM step applied to a probabilistic $k$-PCA model. The probabilistic view of the updates allows us to combine multiple models in a distributed setting. We show experimentally that the implicit Krasulina update yields superior convergence while being significantly faster. We also give strong evidence that the new update can benefit from parallelism and is more stable w.r.t. tuning of the learning rate.


Better Communication Complexity for Local SGD

arXiv.org Machine Learning

We revisit the local Stochastic Gradient Descent (local SGD) method and prove new convergence rates. We close the gap in the theory by showing that it works under unbounded gradients and extend its convergence to weakly convex functions. Furthermore, by changing the assumptions, we manage to get new bounds that explain in what regimes local SGD is faster that its non-local version. For instance, if the objective is strongly convex, we show that, up to constants, it is sufficient to synchronize $M$ times in total, where $M$ is the number of nodes. This improves upon the known requirement of Stich (2018) of $\sqrt{TM}$ synchronization times in total, where $T$ is the total number of iterations, which helps to explain the empirical success of local SGD.


Gradient Descent with Compressed Iterates

arXiv.org Machine Learning

We propose and analyze a new type of stochastic first order method: gradient descent with compressed iterates (GDCI). GDCI in each iteration first compresses the current iterate using a lossy randomized compression technique, and subsequently takes a gradient step. This method is a distillation of a key ingredient in the current practice of federated learning, where a model needs to be compressed by a mobile device before it is sent back to a server for aggregation. Our analysis provides a step towards closing the gap between the theory and practice of federated learning, and opens the possibility for many extensions.


First Analysis of Local GD on Heterogeneous Data

arXiv.org Machine Learning

We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heterogeneous. We show that in a low accuracy regime, the method has the same communication complexity as gradient descent.


Byzantine-Resilient Stochastic Gradient Descent for Distributed Learning: A Lipschitz-Inspired Coordinate-wise Median Approach

arXiv.org Machine Learning

-- In this work, we consider the resilience of distributed algorithms based on stochastic gradient descent (SGD) in distributed learning with potentially Byzantine attackers, who could send arbitrary information to the parameter server to disrupt the training process. T oward this end, we propose a new Lipschitz-inspired coordinate-wise median approach (LICM-SGD) to mitigate Byzantine attacks. We show that our LICM-SGD algorithm can resist up to half of the workers being Byzantine attackers, while still converging almost surely to a stationary region in non-convex settings. Also, our LICM-SGD method does not require any information about the number of attackers and the Lipschitz constant, which makes it attractive for practical implementations. Moreover, our LICM-SGD method enjoys the optimal O ( md) computational time-complexity in the sense that the time-complexity is the same as that of the standard SGD under no attacks. We conduct extensive experiments to show that our LICM-SGD algorithm consistently outperforms existing methods in training multi-class logistic regression and convolutional neural networks with MNIST and CIF AR-10 datasets. In our experiments, LICM-SGD also achieves a much faster running time thanks to its low computational time-complexity. Fueled by the rise of machine learning and big data analytics, recent years have witnessed an ever-increasing interest in solving large-scale empirical risk minimization problems (ERM) - a fundamental optimization problem that underpins a wide range of machine learning applications. In the post-Moore's-Law era, however, to sustain the rapidly growing computational power needs for solving large-scale ERM, the only viable solution is to exploit parallelism at and across different spatial scales. Indeed, the recent success of machine learning applications is due in large part to the use of distributed machine learning frameworks (e.g., TensorFlow [1] and others) which exploit the abundance of distributed CPU/GPU resources in large-scale computing clusters.


Communication-Censored Distributed Stochastic Gradient Descent

arXiv.org Machine Learning

This paper develops a communication-efficient algorithm to solve the stochastic optimization problem defined over a distributed network, aiming at reducing the burdensome communication in applications such as distributed machine learning. Different from the existing works based on quantization and sparsification, we introduce a communication-censoring technique to reduce the transmissions of variables, which leads to our communication-Censored distributed Stochastic Gradient Descent (CSGD) algorithm. Specifically, in CSGD, the latest mini-batch stochastic gradient at a worker will be transmitted to the server only if it is sufficiently informative. When the latest gradient is not available, the stale one will be reused at the server. To implement this communication-censoring strategy, the batch sizes are increasing in order to alleviate the effect of gradient noise. Theoretically, CSGD enjoys the same order of convergence rate as that of SGD, but effectively reduces communication. Numerical experiments further demonstrate the sizable communication saving of CSGD.


A Stochastic Quasi-Newton Method with Nesterov's Accelerated Gradient

arXiv.org Machine Learning

Incorporating second order curvature information in gradient based methods have shown to improve convergence drastically despite its computational intensity. In this paper, we propose a stochastic (online) quasi-Newton method with Nesterov's accelerated gradient in both its full and limited memory forms for solving large scale non-convex optimization problems in neural networks. The performance of the proposed algorithm is evaluated in Tensorflow on benchmark classification and regression problems. The results show improved performance compared to the classical second order oBFGS and oLBFGS methods and popular first order stochastic methods such as SGD and Adam. The performance with different momentum rates and batch sizes have also been illustrated. Keywords: Neural networks · stochastic method · online training · Nesterov's accelerated gradient · quasi-Newton method · limited memory · Tensorflow 1 Introduction Neural networks have shown to be effective in innumerous real-world applications.



Introduction to Online Convex Optimization

arXiv.org Machine Learning

It was written as an advanced text to serve as a basis for a graduate course, and/or as a reference to the researcher diving into this fascinating world at the intersection of optimization and machine learning. Such a course was given at the Technion in the years 2010-2014 with slight variations from year to year, and later at Princeton University in the years 2015-2016. The core material in these courses is fully covered in this book, along with exercises that allow the students to complete parts of proofs, or that were found illuminating and thought-provoking. Most of the material is given with examples of applications, which are interlaced throughout different topics. These include prediction from expert advice, portfolio selection, matrix completion and recommendation systems, SVM training and more.