AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Heavy-tailed Streaming Statistical Estimation

Tsai, Che-Ping, Prasad, Adarsh, Balakrishnan, Sivaraman, Ravikumar, Pradeep

arXiv.org Machine LearningAug-25-2021

We consider the task of heavy-tailed statistical estimation given streaming $p$-dimensional samples. This could also be viewed as stochastic optimization under heavy-tailed distributions, with an additional $O(p)$ space complexity constraint. We design a clipped stochastic gradient descent algorithm and provide an improved analysis, under a more nuanced condition on the noise of the stochastic gradients, which we show is critical when analyzing stochastic optimization problems arising from general statistical estimation problems. Our results guarantee convergence not just in expectation but with exponential concentration, and moreover does so using $O(1)$ batch size. We provide consequences of our results for mean estimation and linear regression. Finally, we provide empirical corroboration of our results and algorithms via synthetic experiments for mean estimation and linear regression.

algorithm, gradient, regression, (13 more...)

arXiv.org Machine Learning

2108.11483

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)

Add feedback

Gradient Descent

#artificialintelligenceAug-24-2021, 07:25:17 GMT

Today we are talking about gradient descent. Firstly what is gradient descent? Gradient descent is an iterative optimization algorithm used in machine learning to minimize a loss function. The loss function describes how well the model will perform given the current set of parameters (weights and biases), and gradient descent is used to find the best set of parameters. We use gradient descent to update the parameters of our model.

gradient descent, learning rate, local minimum, (1 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification

Bénédict, Gabriel, Koops, Vincent, Odijk, Daan, de Rijke, Maarten

arXiv.org Machine LearningAug-24-2021

Multiclass multilabel classification refers to the task of attributing multiple labels to examples via predictions. Current models formulate a reduction of that multilabel setting into either multiple binary classifications or multiclass classification, allowing for the use of existing loss functions (sigmoid, cross-entropy, logistic, etc.). Empirically, these methods have been reported to achieve good performance on different metrics (F1 score, Recall, Precision, etc.). Theoretically though, the multilabel classification reductions does not accommodate for the prediction of varying numbers of labels per example and the underlying losses are distant estimates of the performance metrics. We propose a loss function, sigmoidF1. It is an approximation of the F1 score that (I) is smooth and tractable for stochastic gradient descent, (II) naturally approximates a multilabel metric, (III) estimates label propensities and label counts. More generally, we show that any confusion matrix metric can be formulated with a smooth surrogate. We evaluate the proposed loss function on different text and image datasets, and with a variety of metrics, to account for the complexity of multilabel classification evaluation. In our experiments, we embed the sigmoidF1 loss in a classification head that is attached to state-of-the-art efficient pretrained neural networks MobileNetV2 and DistilBERT. Our experiments show that sigmoidF1 outperforms other loss functions on four datasets and several metrics. These results show the effectiveness of using inference-time metrics as loss function at training time in general and their potential on non-trivial classification problems like multilabel classification.

classification, dataset, loss function, (14 more...)

arXiv.org Machine Learning

2108.10566

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Netherlands > North Holland > Amsterdam (0.04)
North America > United States > New York > New York County > New York City (0.04)
(8 more...)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Gradient Descent: Design Your First Machine Learning Model

#artificialintelligenceAug-23-2021, 12:10:36 GMT

Gradient descent is an optimization algorithm that is used to train machine learning models and is now used in a neural network. Training data helps the model learn over time as gradient descent act as an automatic system that tunes parameters to achieve better results. These parameters are updated after each iteration until the function achieves the smallest possible error. The red arrow in the figure below is a gradient and by updating our parameters after each iteration we can reduce loss which is our primary goal. According to Arthur Samuel, gradient descent is the automatic processing of altering weights to maximize performance Fast AI.

gradient descent, iteration, machine learning model, (7 more...)

#artificialintelligence

Industry: Health & Medicine (0.33)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping

Kuzborskij, Ilja, Szepesvári, Csaba

arXiv.org Machine LearningAug-23-2021

We explore the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noisy labels, neural networks trained to nearly zero training error are inconsistent on this class, we propose an early stopping rule that allows us to show optimal rates. This provides an alternative to the result of Hu et al. (2021) who studied the performance of $\ell 2$ -regularized GD for training shallow networks in nonparametric regression which fully relied on the infinite-width network (Neural Tangent Kernel (NTK)) approximation. Here we present a simpler analysis which is based on a partitioning argument of the input space (as in the case of 1-nearest-neighbor rule) coupled with the fact that trained neural networks are smooth with respect to their inputs when trained by GD. In the noise-free case the proof does not rely on any kernelization and can be regarded as a finite-width result. In the case of label noise, by slightly modifying the proof, the noise is controlled using a technique of Yao, Rosasco, and Caponnetto (2007).

inequality, neural network, regression function, (16 more...)

arXiv.org Machine Learning

2107.05341

Country:

North America > Canada > Alberta (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Rate distortion comparison of a few gradient quantizers

Adikari, Tharindu

arXiv.org Artificial IntelligenceAug-22-2021

This article is in the context of gradient compression. Gradient compression is a popular technique for mitigating the communication bottleneck observed when training large machine learning models in a distributed manner using gradient-based methods such as stochastic gradient descent. In this article, assuming a Gaussian distribution for the components in gradient, we find the rate distortion trade-off of gradient quantization schemes such as Scaled-sign and Top-K, and compare with the Shannon rate distortion limit. A similar comparison with vector quantizers also is presented.

distortion, quantizer, reconstruction point, (13 more...)

arXiv.org Artificial Intelligence

2108.09899

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Canada > Quebec > Montreal (0.04)
North America > United States > California (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback

How Can Increased Randomness in Stochastic Gradient Descent Improve Generalization?

Bradley, Arwen V., Gomez-Uribe, Carlos Alberto

arXiv.org Machine LearningAug-21-2021

Recent works report that increasing the learning rate or decreasing the minibatch size in stochastic gradient descent (SGD) can improve test set performance. We argue this is expected under some conditions in models with a loss function with multiple local minima. Our main contribution is an approximate but analytical approach inspired by methods in Physics to study the role of the SGD learning rate and batch size in generalization. We characterize test set performance under a shift between the training and test data distributions for loss functions with multiple minima. The shift can simply be due to sampling, and is therefore typically present in practical applications. We show that the resulting shift in local minima worsens test performance by picking up curvature, implying that generalization improves by selecting wide and/or little-shifted local minima. We then specialize to SGD, and study its test performance under stationarity. Because obtaining the exact stationary distribution of SGD is intractable, we derive a Fokker-Planck approximation of SGD and obtain its stationary distribution instead. This process shows that the learning rate divided by the minibatch size plays a role analogous to temperature in statistical mechanics, and implies that SGD, including its stationary distribution, is largely invariant to changes in learning rate or batch size that leave its temperature constant. We show that increasing SGD temperature encourages the selection of local minima with lower curvature, and can enable better generalization. We provide experiments on CIFAR10 demonstrating the temperature invariance of SGD, improvement of the test loss as SGD temperature increases, and quantifying the impact of sampling versus domain shift in driving this effect. Finally, we present synthetic experiments showing how our theory applies in a simplified loss with two local minima.

batch size, experiment, minima, (15 more...)

arXiv.org Machine Learning

2108.09507

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Cupertino (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Fast Margin Maximization via Dual Acceleration

Ji, Ziwei, Srebro, Nathan, Telgarsky, Matus

arXiv.org Machine LearningAug-21-2021

We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables.

algorithm 1, fast margin maximization, gradient descent, (12 more...)

arXiv.org Machine Learning

2107.00595

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update

Luo, Junyu, Yang, Jianlei, Ye, Xucheng, Guo, Xin, Zhao, Weisheng

arXiv.org Artificial IntelligenceAug-20-2021

Federated learning aims to protect users' privacy while performing data analysis from different participants. However, it is challenging to guarantee the training efficiency on heterogeneous systems due to the various computational capabilities and communication bottlenecks. In this work, we propose FedSkel to enable computation-efficient and communication-efficient federated learning on edge devices by only updating the model's essential parts, named skeleton networks. FedSkel is evaluated on real edge devices with imbalanced datasets. Experimental results show that it could achieve up to 5.52$\times$ speedups for CONV layers' back-propagation, 1.82$\times$ speedups for the whole training process, and reduce 64.8% communication cost, with negligible accuracy loss.

fedskel, fl system, skeleton network, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3459637.3482107

2108.09081

Country:

Oceania > Australia (0.06)
North America > United States > Virginia (0.05)
North America > United States > New York > New York County > New York City (0.04)
Asia > China (0.04)

Genre: Research Report (0.84)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.41)

Add feedback

On Accelerating Distributed Convex Optimizations

Chakrabarti, Kushal, Gupta, Nirupam, Chopra, Nikhil

arXiv.org Machine LearningAug-19-2021

This paper studies a distributed multi-agent convex optimization problem. The system comprises multiple agents in this problem, each with a set of local data points and an associated local cost function. The agents are connected to a server, and there is no inter-agent communication. The agents' goal is to learn a parameter vector that optimizes the aggregate of their local costs without revealing their local data points. In principle, the agents can solve this problem by collaborating with the server using the traditional distributed gradient-descent method. However, when the aggregate cost is ill-conditioned, the gradient-descent method (i) requires a large number of iterations to converge, and (ii) is highly unstable against process noise. We propose an iterative pre-conditioning technique to mitigate the deleterious effects of the cost function's conditioning on the convergence rate of distributed gradient-descent. Unlike the conventional pre-conditioning techniques, the pre-conditioner matrix in our proposed technique updates iteratively to facilitate implementation on the distributed network. In the distributed setting, we provably show that the proposed algorithm converges linearly with an improved rate of convergence than the traditional and adaptive gradient-descent methods. Additionally, for the special case when the minimizer of the aggregate cost is unique, our algorithm converges superlinearly. We demonstrate our algorithm's superior performance compared to prominent distributed algorithms for solving real logistic regression problems and emulating neural network training via a noisy quadratic model, thereby signifying the proposed algorithm's efficiency for distributively solving non-convex optimization. Moreover, we empirically show that the proposed algorithm results in faster training without compromising the generalization performance.

algorithm, iteration, noisy quadratic model, (13 more...)

arXiv.org Machine Learning

2108.0867

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
Europe > Switzerland > Vaud > Lausanne (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)

Add feedback