Gradient Descent
Online greedy identification of linear dynamical systems
Blanke, Matthieu, Lelarge, Marc
This work addresses the problem of exploration in an unknown environment. For linear dynamical systems, we use an experimental design framework and introduce an online greedy policy where the control maximizes the information of the next step. In a setting with a limited number of experimental trials, our algorithm has low complexity and shows experimentally competitive performances compared to more elaborate gradient-based methods.
Optimizer in Deep Learning
An optimizer is a function or an algorithm that customizes the attributes of the neural network, such as weights and discovering rate. Hence, it assists in decreasing the overall loss and also enhance the accuracy. The problem of picking the ideal weights for the version is an overwhelming job, as a deep learning version usually includes numerous parameters. It increases the requirement to pick an appropriate optimization algorithm for your application. You can utilize different optimizers to make changes in your weights as well as learning price.
What is momentum in a Neural network and how does it work?
In a neural network, there is the concept of loss, which is used to calculate performance. The higher the loss, the poorer the performance of the neural network, that is why we always try to minimize the loss so that the neural network performs better. The process of minimizing loss is called optimization. An optimizer is a method that modifies the weights of the neural network to reduce the loss. Although several neural network optimizers exist, in this article we will learn about gradient descent with momentum and compare its performance with others.
Scalable Whitebox Attacks on Tree-based Models
Castiglione, Giuseppe, Ding, Gavin, Hashemi, Masoud, Srinivasa, Christopher, Wu, Ga
Adversarial robustness is one of the essential safety criteria for guaranteeing the reliability of machine learning models. While various adversarial robustness testing approaches were introduced in the last decade, we note that most of them are incompatible with non-differentiable models such as tree ensembles. Since tree ensembles are widely used in industry, this reveals a crucial gap between adversarial robustness research and practical applications. This paper proposes a novel whitebox adversarial robustness testing approach for tree ensemble models. Concretely, the proposed approach smooths the tree ensembles through temperature controlled sigmoid functions, which enables gradient descent-based adversarial attacks. By leveraging sampling and the log-derivative trick, the proposed approach can scale up to testing tasks that were previously unmanageable. We compare the approach against both random perturbations and blackbox approaches on multiple public datasets (and corresponding models). Our results show that the proposed method can 1) successfully reveal the adversarial vulnerability of tree ensemble models without causing computational pressure for testing and 2) flexibly balance the search performance and time complexity to meet various testing criteria.
Local optimisation of Nystr\"om samples through stochastic gradient descent
Hutchings, Matthew, Gauthier, Bertrand
We study a relaxed version of the column-sampling problem for the Nystr\"om approximation of kernel matrices, where approximations are defined from multisets of landmark points in the ambient space; such multisets are referred to as Nystr\"om samples. We consider an unweighted variation of the radial squared-kernel discrepancy (SKD) criterion as a surrogate for the classical criteria used to assess the Nystr\"om approximation accuracy; in this setting, we discuss how Nystr\"om samples can be efficiently optimised through stochastic gradient descent. We perform numerical experiments which demonstrate that the local minimisation of the radial SKD yields Nystr\"om samples with improved Nystr\"om approximation accuracy.
Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum
Banman, Kirby, Peet-Pare, Liam, Hegde, Nidhi, Fyshe, Alona, White, Martha
Most convergence guarantees for stochastic gradient descent with momentum (SGDm) rely on iid sampling. Yet, SGDm is often used outside this regime, in settings with temporally correlated input samples such as continual learning and reinforcement learning. Existing work has shown that SGDm with a decaying step-size can converge under Markovian temporal correlation. In this work, we show that SGDm under covariate shift with a fixed step-size can be unstable and diverge. In particular, we show SGDm under covariate shift is a parametric oscillator, and so can suffer from a phenomenon known as resonance. We approximate the learning system as a time varying system of ordinary differential equations, and leverage existing theory to characterize the system's divergence/convergence as resonant/nonresonant modes. The theoretical result is limited to the linear setting with periodic covariate shift, so we empirically supplement this result to show that resonance phenomena persist even under non-periodic covariate shift, nonlinear dynamics with neural networks, and optimizers other than SGDm.
Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings
Wang, Dongsheng, Guo, Dandan, Zhao, He, Zheng, Huangjie, Tanwisuth, Korawat, Chen, Bo, Zhou, Mingyuan
A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions. It is focused on capturing the word co-occurrences in a document and hence often suffers from poor performance in analyzing short documents. In addition, its parameter estimation often relies on approximate posterior inference that is either not scalable or suffers from large approximation error. This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space. Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents. Experiments on text analysis demonstrate that the proposed method, which is amenable to mini-batch stochastic gradient descent based optimization and hence scalable to big corpora, provides competitive performance in discovering more coherent and diverse topics and extracting better document representations.
Linear Model the Machine Learning Way
The Ordinary Least Squares model (OLS) is a central building block in Machine Learning (ML). OLS is also used everywhere in Social Sciences. I come from an Economics background and I was initially a bit puzzled by the way the ML textbooks solve OLS. In this blog post, I explain the Economics way versus the ML way and why both make sense. TL;DR: In a high-dimensional setting, do not inverse a huge matrix, use gradient descent.
Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime
Zou, Difan, Wu, Jingfeng, Braverman, Vladimir, Gu, Quanquan, Kakade, Sham M.
Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.