AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

GRAFFL: Gradient-free Federated Learning of a Bayesian Generative Model

arXiv.org Machine LearningAug-29-2020

Federated learning platforms are gaining popularity. One of the major benefits is to mitigate the privacy risks as the learning of algorithms can be achieved without collecting or sharing data. While federated learning (i.e., many based on stochastic gradient algorithms) has shown great promise, there are still many challenging problems in protecting privacy, especially during the process of gradients update and exchange. This paper presents the first gradient-free federated learning framework called GRAFFL for learning a Bayesian generative model based on approximate Bayesian computation. Unlike conventional federated learning algorithms based on gradients, our framework does not require to disassemble a model (i.e., to linear components) or to perturb data (or encryption of data for aggregation) to preserve privacy. Instead, this framework uses implicit information derived from each participating institution to learn posterior distributions of parameters. The implicit information is summary statistics derived from SuffiAE that is a neural network developed in this study to create compressed and linearly separable representations thereby protecting sensitive information from leakage. As a sufficient dimensionality reduction technique, this is proved to provide sufficient summary statistics. We propose the GRAFFL-based Bayesian Gaussian mixture model to serve as a proof-of-concept of the framework. Using several datasets, we demonstrated the feasibility and usefulness of our model in terms of privacy protection and prediction performance (i.e., close to an ideal setting). The trained model as a quasi-global model can generate informative samples involving information from other institutions and enhances data analysis of each institution.

artificial intelligence, machine learning, summary statistics, (16 more...)

arXiv.org Machine Learning

2008.12925

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Virginia (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.36)

Add feedback

ROOT-SGD: Sharp Nonasymptotics and Asymptotic Efficiency in a Single Algorithm

Li, Chris Junchi, Mou, Wenlong, Wainwright, Martin J., Jordan, Michael I.

arXiv.org Machine LearningAug-28-2020

The theory and practice of stochastic optimization has focused on stochastic gradient descent (SGD) in recent years, retaining the basic first-order stochastic nature of SGD while aiming to improve it via mechanisms such as averaging, momentum, and variance reduction. Improvement can be measured along various dimensions, however, and it has proved difficult to achieve improvements both in terms of nonasymptotic measures of convergence rate and asymptotic measures of distributional tightness. In this work, we consider first-order stochastic optimization from a general statistical point of view, motivating a specific form of recursive averaging of past stochastic gradients. The resulting algorithm, which we refer to as \emph{Recursive One-Over-T SGD} (ROOT-SGD), matches the state-of-the-art convergence rate among online variance-reduced stochastic approximation methods. Moreover, under slightly stronger distributional assumptions, the rescaled last-iterate of ROOT-SGD converges to a zero-mean Gaussian distribution that achieves near-optimal covariance.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

2008.1269

Country:

Asia > Middle East > Jordan (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report (0.50)
Workflow (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.77)

Add feedback

Predicting Training Time Without Training

Zancato, Luca, Achille, Alessandro, Ravichandran, Avinash, Bhotika, Rahul, Soatto, Stefano

arXiv.org Machine LearningAug-28-2020

We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. To do so, we leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. This allows us to approximate the training loss and accuracy at any point during training by solving a low-dimensional Stochastic Differential Equation (SDE) in function space. Using this result, we are able to predict the time it takes for Stochastic Gradient Descent (SGD) to fine-tune a model to a given loss without having to perform any training. In our experiments, we are able to predict training time of a ResNet within a 20% error margin on a variety of datasets and hyper-parameters, at a 30 to 45-fold reduction in cost compared to actual training. We also discuss how to further reduce the computational and memory cost of our method, and in particular we show that by exploiting the spectral properties of the gradients' matrix it is possible predict training time on a large dataset while processing only a subset of the samples.

artificial intelligence, machine learning, training time, (17 more...)

arXiv.org Machine Learning

2008.12478

Country:

North America > Canada > Ontario > Toronto (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > France (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Understanding and Detecting Convergence for Stochastic Gradient Descent with Momentum

Chee, Jerry, Li, Ping

arXiv.org Machine LearningAug-27-2020

Convergence detection of iterative stochastic optimization methods is of great practical interest. This paper considers stochastic gradient descent (SGD) with a constant learning rate and momentum. We show that there exists a transient phase in which iterates move towards a region of interest, and a stationary phase in which iterates remain bounded in that region around a minimum point. We construct a statistical diagnostic test for convergence to the stationary phase using the inner product between successive gradients and demonstrate that the proposed diagnostic works well. We theoretically and empirically characterize how momentum can affect the test statistic of the diagnostic, and how the test statistic captures a relatively sparse signal within the gradients in convergence. Finally, we demonstrate an application to automatically tune the learning rate by reducing it each time stationarity is detected, and show the procedure is robust to mis-specified initial rates.

artificial intelligence, machine learning, stationary phase, (16 more...)

arXiv.org Machine Learning

2008.12224

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(12 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Noise-induced degeneration in online learning

Sato, Yuzuru, Tsutsui, Daiji, Fujiwara, Akio

arXiv.org Machine LearningAug-27-2020

The gradient descent is the simplest optimisation algorithm represented by gradient dynamics in a potential. When the input data is finite, gradient descent dynamics fluctuates due to the finite size effects, and is called stochastic gradient descent. In this paper, we study stability of stochastic gradient descent dynamics from the viewpoint of dynamical systems theory. Learning is characterised as nonautonomous dynamics driven by uncertain input from the external, and as multi-scale dynamics which consists of slow memory dynamics and fast system dynamics. When the uncertain input sequences are modelled by stochastic processes, dynamics of learning is described by a random dynamical system. In contrast to the traditional Fokker-Planck approaches [5, 15], the random dynamical system approaches enable the study not only of stationary distributions and global statistics, but also of the pathwise structure of stochastic dynamics. Based on nonautonomous and random dynamical system theory, it is possible to analyse stability and bifurcation in machine learning.

artificial intelligence, gradient descent, machine learning, (16 more...)

arXiv.org Machine Learning

2008.10498

Country:

Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.05)
Asia > Japan > Hokkaidō > Hokkaidō Prefecture > Sapporo (0.04)

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (0.42)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Blind Descent: A Prequel to Gradient Descent

Gupta, Akshat, R, Prasad N

arXiv.org Machine LearningAug-26-2020

We describe an alternative learning method for neural networks, which we call Blind Descent. By design, Blind Descent does not face problems like exploding or vanishing gradients. In Blind Descent, gradients are not used to guide the learning process. In this paper, we present Blind Descent as a more fundamental learning process compared to gradient descent. We also show that gradient descent can be seen as a specific case of the Blind Descent algorithm. We also train two neural network architectures, a multilayer perceptron and a convolutional neural network, using the most general Blind Descent algorithm to demonstrate a proof of concept.

artificial intelligence, blind descent, machine learning, (12 more...)

arXiv.org Machine Learning

2006.11505

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.56)

Add feedback

Deep Networks and the Multiple Manifold Problem

Buchanan, Sam, Gilboa, Dar, Wright, John

arXiv.org Machine LearningAug-25-2020

We study the multiple manifold problem, a binary classification task modeled on applications in machine vision, in which a deep fully-connected neural network is trained to separate two low-dimensional submanifolds of the unit sphere. We provide an analysis of the one-dimensional case, proving for a simple manifold configuration that when the network depth $L$ is large relative to certain geometric and statistical properties of the data, the network width $n$ grows as a sufficiently large polynomial in $L$, and the number of i.i.d. samples from the manifolds is polynomial in $L$, randomly-initialized gradient descent rapidly learns to classify the two manifolds perfectly with high probability. Our analysis demonstrates concrete benefits of depth and width in the context of a practically-motivated model problem: the depth acts as a fitting resource, with larger depths corresponding to smoother networks that can more readily separate the class manifolds, and the width acts as a statistical resource, enabling concentration of the randomly-initialized network and its gradients. The argument centers around the neural tangent kernel and its role in the nonasymptotic analysis of training overparameterized neural networks; to this literature, we contribute essentially optimal rates of concentration for the neural tangent kernel of deep fully-connected networks, requiring width $n \gtrsim L\,\mathrm{poly}(d_0)$ to achieve uniform concentration of the initial kernel over a $d_0$-dimensional submanifold of the unit sphere $\mathbb{S}^{n_0-1}$, and a nonasymptotic framework for establishing generalization of networks trained in the NTK regime with structured data. The proof makes heavy use of martingale concentration to optimally treat statistical dependencies across layers of the initial random network. This approach should be of use in establishing similar results for other network architectures.

artificial intelligence, machine learning, order statistics, (19 more...)

arXiv.org Machine Learning

2008.11245

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > India (0.04)
(2 more...)

Genre: Research Report (0.81)

Industry: Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

Stochastic Markov Gradient Descent and Training Low-Bit Neural Networks

Ashbrock, Jonathan, Powell, Alexander M.

arXiv.org Machine LearningAug-25-2020

The massive size of modern neural networks has motivated substantial recent interest in neural network quantization. We introduce Stochastic Markov Gradient Descent (SMGD), a discrete optimization method applicable to training quantized neural networks. The SMGD algorithm is designed for settings where memory is highly constrained during training. We provide theoretical guarantees of algorithm performance as well as encouraging numerical results.

artificial intelligence, machine learning, neural network, (15 more...)

arXiv.org Machine Learning

2008.11117

Country:

North America > United States > Tennessee > Davidson County > Nashville (0.04)
Asia > Taiwan > Taiwan Province > Taipei (0.04)

Genre: Research Report (0.50)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Channel-Directed Gradients for Optimization of Convolutional Neural Networks

Lao, Dong, Zhu, Peihao, Wonka, Peter, Sundaramoorthi, Ganesh

arXiv.org Artificial IntelligenceAug-24-2020

We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. The method requires only simple processing of existing stochastic gradients, can be used in conjunction with any optimizer, and has only a linear overhead (in the number of parameters) compared to computation of the stochastic gradient. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental. We present the continuum theory of such gradients, its discretization, and application to deep networks. Experiments on benchmark datasets, several networks and baseline optimizers show that optimizers can be improved in generalization error by simply computing the stochastic gradient with respect to output-channel directed metrics. Stochastic gradient descent (SGD) is currently the dominant algorithm for optimizing large-scale convolutional neural networks (CNNs) LeCun et al. (1998); Simonyan & Zisserman (2014); He et al. (2016b). Although there has been large activity in optimization methods seeking to improve performance, SGD still dominates in large-scale CNN optimization in terms of its generalization ability. Despite SGD's dominance, there is still often a gap between training and real-world test accuracy performance in applications, which necessitates research in optimization methods to increase generalization accuracy.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2008.10766

Country: Asia > Middle East > Saudi Arabia > Mecca Province > Thuwal (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Periodic Stochastic Gradient Descent with Momentum for Decentralized Training

Gao, Hongchang, Huang, Heng

arXiv.org Machine LearningAug-24-2020

Decentralized training has been actively studied in recent years. Although a wide variety of methods have been proposed, yet the decentralized momentum SGD method is still underexplored. In this paper, we propose a novel periodic decentralized momentum SGD method, which employs the momentum schema and periodic communication for decentralized training. With these two strategies, as well as the topology of the decentralized training system, the theoretical convergence analysis of our proposed method is difficult. We address this challenging problem and provide the condition under which our proposed method can achieve the linear speedup regarding the number of workers. Furthermore, we also introduce a communication-efficient variant to reduce the communication cost in each communication round. The condition for achieving the linear speedup is also provided for this variant. To the best of our knowledge, these two methods are all the first ones achieving these theoretical results in their corresponding domain. We conduct extensive experiments to verify the performance of our proposed two methods, and both of them have shown superior performance over existing methods.

communication, communication cost, momentum sgd, (13 more...)

arXiv.org Machine Learning

2008.10435

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)

Add feedback