AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Projected Stein Variational Gradient Descent

arXiv.org Machine LearningFeb-9-2020

The curse of dimensionality is a critical challenge in Bayesian inference for high dimensional parameters. In this work, we address this challenge by developing a projected Stein variational gradient descent (pSVGD) method, which projects the parameters into a subspace that is adaptively constructed using the gradient of the log-likelihood, and applies SVGD for the much lower-dimensional coefficients of the projection. We provide an upper bound for the projection error with respect to the posterior and demonstrate the accuracy (compared to SVGD) and scalability of pSVGD with respect to the number of parameters, samples, data points, and processor cores.

dimension, posterior, psvgd, (13 more...)

arXiv.org Machine Learning

2002.03469

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

Momentum Improves Normalized SGD

Cutkosky, Ashok, Mehta, Harsh

arXiv.org Machine LearningFeb-9-2020

We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to find an ɛ-critical point in O(1/ɛ 3.5) iterations, matching the best-known rates without accruing any logarithmic factors or dependence on dimension. We also provide an adaptive method that automatically improves convergence rates when the variance in the gradients is small. Finally, we show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks. The rise of deep learning has focused research attention on the problem of solving optimization problems that are high-dimensional, large-scale, and non-convex. Modern neural networks can have billions of parameters (high-dimensional) Raffel et al. (2019); Shazeer et al. (2017), are trained using datasets containing millions of examples (large scale) Deng et al. (2009) on objective functions that are non-convex. Because of these considerations, stochastic gradient descent (SGD) has emerged as the de-facto method-of-choice for training deep models.

algorithm, momentum, optimization, (13 more...)

arXiv.org Machine Learning

2002.03305

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.55)

Add feedback

Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection

Song, Qifan, Sun, Yan, Ye, Mao, Liang, Faming

arXiv.org Machine LearningFeb-7-2020

Stochastic gradient Markov chain Monte Carlo (MCMC) algorithms have received much attention in Bayesian computing for big data problems, but they are only applicable to a small class of problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. This paper proposes an extended stochastic gradient MCMC lgoriathm which, by introducing appropriate latent variables, can be applied to more general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. Numerical studies show that the proposed algorithm is highly scalable and much more efficient than traditional MCMC algorithms. The proposed algorithms have much alleviated the pain of Bayesian methods in big data computing.

algorithm, iteration, stochastic gradient langevin, (10 more...)

arXiv.org Machine Learning

2002.02919

Country:

North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
Europe > United Kingdom (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

Achieving the fundamental convergence-communication tradeoff with Differentially Quantized Gradient Descent

Lin, Chung-Yi, Kostina, Victoria, Hassibi, Babak

arXiv.org Machine LearningFeb-6-2020

The problem of reducing the communication cost in distributed training through gradient quantization is considered. For the class of smooth and strongly convex objective functions, we characterize the minimum achievable linear convergence rate for a given number of bits per problem dimension $n$. We propose Differentially Quantized Gradient Descent, a quantization algorithm with error compensation, and prove that it achieves the fundamental tradeoff between communication rate and convergence rate as $n$ goes to infinity. In contrast, the naive quantizer that compresses the current gradient directly fails to achieve that optimal tradeoff. Experimental results on both simulated and real-world least-squares problems confirm our theoretical analysis.

convergence rate, convergence-communication tradeoff, linear convergence rate, (11 more...)

arXiv.org Machine Learning

2002.02508

Country:

North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > Nevada (0.04)
(7 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)

Add feedback

Faster On-Device Training Using New Federated Momentum Algorithm

Huo, Zhouyuan, Yang, Qian, Gu, Bin, Huang, Lawrence Carin. Heng

arXiv.org Machine LearningFeb-5-2020

Mobile crowdsensing has gained significant attention in recent years and has become a critical paradigm for emerging Internet of Things applications. The sensing devices continuously generate a significant quantity of data, which provide tremendous opportunities to develop innovative intelligent applications. To utilize these data to train machine learning models while not compromising user privacy, federated learning has become a promising solution. However, there is little understanding of whether federated learning algorithms are guaranteed to converge. We reconsider model averaging in federated learning and formulate it as a gradient-based method with biased gradients. This novel perspective assists analysis of its convergence rate and provides a new direction for more acceleration. We prove for the first time that the federated averaging algorithm is guaranteed to converge for non-convex problems, without imposing additional assumptions. We further propose a novel accelerated federated learning algorithm and provide a convergence guarantee. Simulated federated learning experiments are conducted to train deep neural networks on benchmark datasets, and experimental results show that our proposed method converges faster than previous approaches.

arxiv preprint arxiv, fedavg, non-convex problem, (13 more...)

arXiv.org Machine Learning

2002.0209

Country:

North America > United States > District of Columbia > Washington (0.05)
Europe > Russia (0.04)
Asia > Russia (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Entropy Minimization vs. Diversity Maximization for Domain Adaptation

Wu, Xiaofu, hang, Suofei, Zhou, Quan, Yang, Zhen, Zhao, Chunming, Latecki, Longin Jan

arXiv.org Machine LearningFeb-5-2020

Entropy minimization has been widely used in unsupervised domain adaptation (UDA). However, existing works reveal that entropy minimization only may result into collapsed trivial solutions. In this paper, we propose to avoid trivial solutions by further introducing diversity maximization. In order to achieve the possible minimum target risk for UDA, we show that diversity maximization should be elaborately balanced with entropy minimization, the degree of which can be finely controlled with the use of deep embedded validation in an unsupervised manner. The proposed minimal-entropy diversity maximization (MEDM) can be directly implemented by stochastic gradient descent without use of adversarial learning. Empirical evidence demonstrates that MEDM outperforms the state-of-the-art methods on four popular domain adaptation datasets.

adaptation, domain adaptation, minimization, (13 more...)

arXiv.org Machine Learning

2002.0169

Country:

Asia > China > Jiangsu Province > Nanjing (0.05)
Asia > Middle East > Jordan (0.05)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Minimax Defense against Gradient-based Adversarial Attacks

Lindqvist, Blerta, Izmailov, Rauf

arXiv.org Machine LearningFeb-4-2020

State-of-the-art adversarial attacks are aimed at neural network classifiers. By default, neural networks use gradient descent to minimize their loss function. The gradient of a classifier's loss function is used by gradient-based adversarial attacks to generate adversarially perturbed images. We pose the question whether another type of optimization could give neural network classifiers an edge. Here, we introduce a novel approach that uses minimax optimization to foil gradient-based adversarial attacks. Our minimax classifier is the discriminator of a generative adversarial network (GAN) that plays a minimax game with the GAN generator. In addition, our GAN generator projects all points onto a manifold that is different from the original manifold since the original manifold might be the cause of adversarial attacks. To measure the performance of our minimax defense, we use adversarial attacks - Carlini Wagner (CW), DeepFool, Fast Gradient Sign Method (FGSM) - on three datasets: MNIST, CIFAR-10 and German Traffic Sign (TRAFFIC). Against CW attacks, our minimax defense achieves 98.07% (MNIST-default 98.93%), 73.90% (CIFAR-10-default 83.14%) and 94.54% (TRAFFIC-default 96.97%). Against DeepFool attacks, our minimax defense achieves 98.87% (MNIST), 76.61% (CIFAR-10) and 94.57% (TRAFFIC). Against FGSM attacks, we achieve 97.01% (MNIST), 76.79% (CIFAR-10) and 81.41% (TRAFFIC). Our Minimax adversarial approach presents a significant shift in defense strategy for neural network classifiers.

adversarial attack, classifier, dataset, (15 more...)

arXiv.org Machine Learning

2002.01256

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.47)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Add feedback

Newton-ADMM: A Distributed GPU-Accelerated Optimizer for Multiclass Classification Problems

Fang, Chih-Hao, Kylasa, Sudhir B, Roosta, Fred, Mahoney, Michael W., Grama, Ananth

arXiv.org Machine LearningFeb-4-2020

First-order optimization methods, such as stochastic gradient descent (SGD) and its variants, are widely used in machine learning applications due to their simplicity and low per-iteration costs. However, they often require larger numbers of iterations, with associated communication costs in distributed environments. In contrast, Newton-type methods, while having higher per-iteration costs, typically require a significantly smaller number of iterations, which directly translates to reduced communication costs. In this paper, we present a novel distributed optimizer for classification problems, which integrates a GPU-accelerated Newton-type solver with the global consensus formulation of Alternating Direction of Method Multipliers (ADMM). By leveraging the communication efficiency of ADMM, GPU-accelerated inexact-Newton solver, and an effective spectral penalty parameter selection strategy, we show that our proposed method (i) yields better generalization performance on several classification problems; (ii) significantly outperforms state-of-the-art methods in distributed time to solution; and (iii) offers better scaling on large distributed platforms.

dataset, newton-admm, solver, (15 more...)

arXiv.org Machine Learning

1807.07132

Country:

North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
Oceania > Australia > Queensland (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Oracle lower bounds for stochastic gradient sampling algorithms

Chatterji, Niladri S., Bartlett, Peter L., Long, Philip M.

arXiv.org Machine LearningFeb-1-2020

We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^{d}$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all these algorithms. We show that for every algorithm, there exists a well-conditioned strongly log-concave target density for which the distribution of points generated by the algorithm would be at least $\varepsilon$ away from the target in total variation distance if the number of gradient queries is less than $\Omega(\sigma^2 d/\varepsilon^2)$, where $\sigma^2 d$ is the variance of the stochastic gradient. Our lower bound follows by combining the ideas of Le Cam deficiency routinely used in the comparison of statistical experiments along with standard information theoretic tools used in lower bounding Bayes risk functions. To the best of our knowledge our results provide the first nontrivial dimension-dependent lower bound for this problem.

algorithm, gradient, iteration complexity, (13 more...)

arXiv.org Machine Learning

2002.00291

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Linear regression with gradient descent in R.

#artificialintelligenceJan-31-2020, 21:59:13 GMT

This demonstrates a basic machine learning linear regression. In the outputs, compare the values for intercept and slope from the built-in R lm() method with those that we calculate manually with gradient descent. The plots show how close the red and blue lines overlap.

gradient descent, linear regression, regression

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)

Add feedback