AITopics

1910.12837

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Orange County > Irvine (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Machine LearningOct-28-2019

On the Global Convergence of (Fast) Incremental Expectation Maximization Methods

Karimi, Belhal, Wai, Hoi-To, Moulines, Eric, Lavielle, Marc

The EM algorithm is one of the most popular algorithm for inference in latent data models. The original formulation of the EM algorithm does not scale to large data set, because the whole data set is required at each iteration of the algorithm. To alleviate this problem, Neal and Hinton have proposed an incremental version of the EM (iEM) in which at each iteration the conditional expectation of the latent data (E-step) is updated only for a mini-batch of observations. Another approach has been proposed by Capp\'e and Moulines in which the E-step is replaced by a stochastic approximation step, closely related to stochastic gradient. In this paper, we analyze incremental and stochastic version of the EM algorithm as well as the variance reduced-version of Chen et. al. in a common unifying framework. We also introduce a new version incremental version, inspired by the SAGA algorithm by Defazio et. al. We establish non-asymptotic convergence bounds for global convergence. Numerical applications are presented in this article to illustrate our findings.

algorithm, convergence, statistics, (16 more...)

1910.12521

Country:

Europe > France (0.04)
Asia > China > Hong Kong (0.04)
Oceania > New Zealand > North Island > Waikato (0.04)
(3 more...)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Machine LearningOct-28-2019

Splitting Steepest Descent for Growing Neural Architectures

Liu, Qiang, Wu, Lemeng, Wang, Dilin

We develop a progressive training approach for neural networks which adaptively grows the network structure by splitting existing neurons to multiple off-springs. By leveraging a functional steepest descent idea, we derive a simple criterion for deciding the best subset of neurons to split and a splitting gradient for optimally updating the off-springs. Theoretically, our splitting strategy is a second-order functional steepest descent for escaping saddle points in an $\infty$-Wasserstein metric space, on which the standard parametric gradient descent is a first-order steepest descent. Our method provides a new computationally efficient approach for optimizing neural network structures, especially for learning lightweight neural architectures in resource-constrained settings.

neural network, neuron, splitting, (14 more...)

1910.02366

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Arora, Sanjeev, Du, Simon S., Li, Zhiyuan, Salakhutdinov, Ruslan, Wang, Ruosong, Yu, Dingli

Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et al. (2019a), which allowed studying performance of infinitely wide nets on datasets like CIFAR-10. However, super-quadratic running time of kernel methods makes them best suited for small-data tasks. We report results suggesting neural tangent kernels perform strongly on low-data tasks. 1. On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets. 2. On CIFAR-10 with 10 - 640 training samples, Convolutional NTK consistently beats ResNet-34 by 1% - 3%. 3. On VOC07 testbed for few-shot image classification tasks on ImageNet with transfer learning (Goyal et al., 2019), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance. 4. Comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(Arora et al., 2019a). NTK's efficacy may trace to lower variance of output.

artificial intelligence, dataset, machine learning, (17 more...)

1910.01663

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Puerto Rico > San Juan > San Juan (0.04)
Europe > Switzerland (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Nadiradze, Giorgi, Sabour, Amirmojtaba, Sharma, Aditya, Markov, Ilia, Aksenov, Vitaly, Alistarh, Dan

PopSGD: Decentralized Stochastic Gradient Descent in the Population Model

The population model is a standard way to represent large-scale decentralized distributed systems, in which agents with limited computational power interact in randomly chosen pairs, in order to collectively solve global computational tasks. In contrast with synchronous gossip models, nodes are anonymous, lack a common notion of time, and have no control over their scheduling. In this paper, we examine whether large-scale distributed optimization can be performed in this extremely restrictive setting. We introduce and analyze a natural decentralized variant of stochastic gradient descent (SGD), called PopSGD, in which every node maintains a local parameter, and is able to compute stochastic gradients with respect to this parameter. Every pair-wise node interaction performs a stochastic gradient step at each agent, followed by averaging of the two models. We prove that, under standard assumptions, SGD can converge even in this extremely loose, decentralized setting, for both convex and non-convex objectives. Moreover, surprisingly, in the former case, the algorithm can achieve linear speedup in the number of nodes $n$. Our analysis leverages a new technical connection between decentralized SGD and randomized load-balancing, which enables us to tightly bound the concentration of node parameters. We validate our analysis through experiments, showing that PopSGD can achieve convergence and speedup for large-scale distributed learning tasks in a supercomputing environment.

interaction, lemma 4, node, (14 more...)

1910.12308

Country:

Europe > Austria (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > New Finding (0.92)

Industry: Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

A geometric interpretation of stochastic gradient descent using diffusion metrics

Fioresi, R., Chaudhari, P., Soatto, S.

Stochastic gradient descent (SGD) is a key ingredient in the training of deep neural networks and yet its geometrical significance appears elusive. We study a deterministic model in which the trajectories of our dynamical systems are described via geodesics of a family of metrics arising from the diffusion matrix. These metrics encode information about the highly non-isotropic gradient noise in SGD. We establish a parallel with General Relativity models, where the role of the electromagnetic field is played by the gradient of the loss function. We compute an example of a two layer network.

diffusion metric, equation, gradient descent, (13 more...)

1910.12194

Country:

Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.05)
North America > United States > Pennsylvania (0.04)
North America > United States > New York (0.04)
North America > United States > California (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Communication Efficient Decentralized Training with Multiple Local Updates

Li, Xiang, Yang, Wenhao, Wang, Shusen, Zhang, Zhihua

Decentralized optimization has been demonstrated to be very useful in machine learning. This work studies the communication-efficiency issue in decentralized optimization. We analyze the Periodic Decentralized Stochastic Gradient Descent (PD-SGD) algorithm, a straightforward combination of federated averaging and decentralized SGD. For the setting of for non-convex objective and non-identically distributed data, we prove that PD-SGD converges to a critical point. In particular, the number of local SGDs trades off communication and local computation. From an algorithmic perspective, we analyze a novel version of PD-SGD, which alternates between multiple local updates and multiple decentralized SGDs. We also show that when we periodically shrink the length of local updates, this generalized PD-SGD can better balance the communication-convergence trade-off both theoretically and empirically.

optimization, pd-sgd, residual error, (13 more...)

1910.09126

Country:

North America > United States > Virginia (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

#artificialintelligenceOct-26-2019, 01:06:39 GMT

Path Length Bounds for Gradient Descent

Figure 1: A two-dimensional convex function represented via contour lines. The function value is constant on the boundary of each such ellipse, and decreases as the ellipse becomes smaller and smaller. Let us assume we want to minimize this function starting from a point $A$. The red line shows the path followed by a gradient descent optimizer converging to the minimum point $B$, while the green dashed line represents the direct line joining $A$ and $B$. In today's post, we will discuss an interesting property concerning the trajectory of gradient descent iterates, namely the length of the Gradient Descent curve.

mathbf, path length, sqrt, (16 more...)

#artificialintelligence

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

#artificialintelligenceOct-26-2019, 01:06:29 GMT

Path Length Bounds for Gradient Descent

artificial intelligence, machine learning, optimization

#artificialintelligence

Industry: Media > News (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

arXiv.org Machine LearningOct-26-2019

Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization

Ji, Kaiyi, Wang, Zhe, Zhou, Yi, Liang, Yingbin

Two types of zeroth-order stochastic algorithms have recently been designed for nonconvex optimization respectively based on the first-order techniques SVRG and SARAH/SPIDER. This paper addresses several important issues that are still open in these methods. First, all existing SVRG-type zeroth-order algorithms suffer from worse function query complexities than either zeroth-order gradient descent (ZO-GD) or stochastic gradient descent (ZO-SGD). In this paper, we propose a new algorithm ZO-SVRG-Coord-Rand and develop a new analysis for an existing ZO-SVRG-Coord algorithm proposed in Liu et al. 2018b, and show that both ZO-SVRG-Coord-Rand and ZO-SVRG-Coord (under our new analysis) outperform other exiting SVRG-type zeroth-order methods as well as ZO-GD and ZO-SGD. Second, the existing SPIDER-type algorithm SPIDER-SZO (Fang et al. 2018) has superior theoretical performance, but suffers from the generation of a large number of Gaussian random variables as well as a $\sqrt{\epsilon}$-level stepsize in practice. In this paper, we develop a new algorithm ZO-SPIDER-Coord, which is free from Gaussian variable generation and allows a large constant stepsize while maintaining the same convergence rate and query complexity, and we further show that ZO-SPIDER-Coord automatically achieves a linear convergence rate as the iterate enters into a local PL region without restart and algorithmic modification.

ariance reduced algorithm and analysis, optimization, qk 0, (6 more...)

1910.12166

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Ohio (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)