Collaborating Authors

gradient descent

The Glory of XGBoost


There are so many machine learning algorithms out there, how do you choose the best one for your problem? This question is going to have a different response based on the application and the data. Is it classification, regression, supervised, unsupervised, natural language processing, time series? There are so many avenues to take but in this article I am going to focus on on algorithm that I particularly find very interesting, XGBoost. XGBoost stands for extreme gradient boosting and is an open source library that provides an efficient and effective implementation of gradient boosting.

Top 8 Deep Learning Concepts Every Data Science Professional Must Know


"Deep learning is making a good wave in delivering a solution to difficult problems that have been faced in the field of artificial intelligence (AI) for so many years, as quoted by Yann LeCun, Yoshua Bengio & Geoffrey Hinton." For a data scientist to successfully apply deep learning, they must first understand how to apply the mathematics of modeling, choose the right algorithm to fit your model to the data, and come up with the right technique to implement. In order to get you started, we have come up with a list of deep learning algorithms needed by every data science professional. The cost function used in a neural network is almost similar to the cost function used in any other machine learning model. This helps identify how good your neural network is as compared to the value it predicts (when compared to the actual value).

Gradient Descent for Machine Learning (ML) 101 with Python Tutorial


Gradient descent is one of the most common machine learning algorithms used in neural networks [7], data science, optimization, and machine learning tasks. The gradient descent algorithm and its variants can be found in almost every machine learning model. Gradient descent is a popular optimization method of tuning the parameters in a machine learning model. Its goal is to apply optimization to find the least or minimal error value. It is mostly used to update the parameters of the model -- in this case, parameters refer to coefficients in regression and weights in a neural network.

The Connection Between Applied Mathematics and Deep Learning


In recent years, deep learning (DL) has inspired a myriad of advances within the scientific computing community. This subset of artificial intelligence relies on multiple components of applied mathematics, but what type of relationship do applied mathematicians have with DL? This question was the subject of a plenary talk by Yann LeCun (Facebook and New York University) at the virtual 2020 SIAM Conference on Mathematics of Data Science, which took place earlier this year. LeCun provided a brief history of machine learning (ML), highlighted the mathematical underpinnings of the field, presented both his vision and several broad open questions for ML's future, and discussed applied math's current relation and potential impending contributions. A 2018 SIAM News article by Gilbert Strang, entitled "The Functions of Deep Learning," offers an introduction for those who are unfamiliar with neural networks, ML, and DL.

Technical Perspective: Why Don't Today's Deep Nets Overfit to Their Training Data?

Communications of the ACM

The following article by Zhang et al. is well-known for having highlighted that widespread success of deep learning in artificial intelligence brings with it a fundamental new theoretical challenge, specifically: Why don't today's deep nets overfit to training data? This question has come to animate the theory of deep learning. Let's understand this question in context of supervised learning, where the machine's goal is to learn to provide labels to inputs (for example, learn to label cat pictures with "1" and dog pictures with "0"). Deep learning solves this task by training a net on a suitably large training set of images that have been labeled correctly by humans. The parameters of the net are randomly initialized and thereafter adjusted in many stages via the simplest algorithm imaginable: gradient descent on the current difference between desired output and actual output.

Black-box Adversarial Attacks in Autonomous Vehicle Technology Artificial Intelligence

Despite the high quality performance of the deep neural network in real-world applications, they are susceptible to minor perturbations of adversarial attacks. This is mostly undetectable to human vision. The impact of such attacks has become extremely detrimental in autonomous vehicles with real-time "safety" concerns. The black-box adversarial attacks cause drastic misclassification in critical scene elements such as road signs and traffic lights leading the autonomous vehicle to crash into other vehicles or pedestrians. In this paper, we propose a novel query-based attack method called Modified Simple black-box attack (M-SimBA) to overcome the use of a white-box source in transfer based attack method. Also, the issue of late convergence in a Simple black-box attack (SimBA) is addressed by minimizing the loss of the most confused class which is the incorrect class predicted by the model with the highest probability, instead of trying to maximize the loss of the correct class. We evaluate the performance of the proposed approach to the German Traffic Sign Recognition Benchmark (GTSRB) dataset. We show that the proposed model outperforms the existing models like Transfer-based projected gradient descent (T-PGD), SimBA in terms of convergence time, flattening the distribution of confused class probability, and producing adversarial samples with least confidence on the true class.

Code Adam Gradient Descent Optimization From Scratch


Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function. A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent like AdaGrad and RMSProp update the algorithm to use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values. The Adaptive Movement Estimation algorithm, or Adam for short, is an extension to gradient descent and a natural successor to techniques like AdaGrad and RMSProp that automatically adapts a learning rate for each input variable for the objective function and further smooths the search process by using an exponentially decreasing moving average of the gradient to make updates to variables. In this tutorial, you will discover how to develop gradient descent with Adam optimization algorithm from scratch.

Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise Machine Learning

We consider a one-hidden-layer leaky ReLU network of arbitrary width trained by stochastic gradient descent following an arbitrary initialization. We prove that stochastic gradient descent (SGD) produces neural networks that have classification accuracy competitive with that of the best halfspace over the distribution for a broad class of distributions that includes log-concave isotropic and hard margin distributions. Equivalently, such networks can generalize when the data distribution is linearly separable but corrupted with adversarial label noise, despite the capacity to overfit. We conduct experiments which suggest that for some distributions our generalization bounds are nearly tight. This is the first result that shows that overparameterized neural networks trained by SGD can generalize when the data is corrupted with adversarial label noise.

Learning with Gradient Descent and Weakly Convex Losses Machine Learning

We study the learning performance of gradient descent when the empirical risk is weakly convex, namely, the smallest negative eigenvalue of the empirical risk's Hessian is bounded in magnitude. By showing that this eigenvalue can control the stability of gradient descent, generalisation error bounds are proven that hold under a wider range of step sizes compared to previous work. Out of sample guarantees are then achieved by decomposing the test error into generalisation, optimisation and approximation errors, each of which can be bounded and traded off with respect to algorithmic parameters, sample size and magnitude of this eigenvalue. In the case of a two layer neural network, we demonstrate that the empirical risk can satisfy a notion of local weak convexity, specifically, the Hessian's smallest eigenvalue during training can be controlled by the normalisation of the layers, i.e., network scaling. This allows test error guarantees to then be achieved when the population risk minimiser satisfies a complexity assumption. By trading off the network complexity and scaling, insights are gained into the implicit bias of neural network scaling, which are further supported by experimental findings.

Beyond Procrustes: Balancing-Free Gradient Descent for Asymmetric Low-Rank Matrix Sensing Machine Learning

Low-rank matrix estimation plays a central role in various applications across science and engineering. Recently, nonconvex formulations based on matrix factorization are provably solved by simple gradient descent algorithms with strong computational and statistical guarantees. However, when the low-rank matrices are asymmetric, existing approaches rely on adding a regularization term to balance the scale of the two matrix factors which in practice can be removed safely without hurting the performance when initialized via the spectral method. In this paper, we provide a theoretical justification to this for the matrix sensing problem, which aims to recover a low-rank matrix from a small number of linear measurements. As long as the measurement ensemble satisfies the restricted isometry property, gradient descent -- in conjunction with spectral initialization -- converges linearly without the need of explicitly promoting balancedness of the factors; in fact, the factors stay balanced automatically throughout the execution of the algorithm. Our analysis is based on analyzing the evolution of a new distance metric that directly accounts for the ambiguity due to invertible transforms, and might be of independent interest.