AITopics

The training of Deep Neural Networks usually needs tremendous computing resources. Therefore many deep models are trained in large cluster instead of single machine or GPU. Though major researchs at present try to run whole model on all machines by using asynchronous asynchronous stochastic gradient descent (ASGD)[9], we present a new approach to train deep model parallely - split the model and then seperately train different parts of it in different speed.

artificial intelligence, machine learning, slowgear worker, (14 more...)

1806.03925

Country: North America > Canada > Ontario > Toronto (0.05)

Genre: Research Report (0.43)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Karimi, Hamed, Nutini, Julie, Schmidt, Mark

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-\L{}ojasiewicz Condition

In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the \L{}ojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-\L{}ojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years. We also use the PL inequality to give new analyses of randomized and greedy coordinate descent methods, sign-based gradient descent methods, and stochastic gradient methods in the classic setting (with decreasing or constant step-sizes) as well as the variance-reduced setting. We further propose a generalization that applies to proximal-gradient methods for non-smooth optimization, leading to simple proofs of linear convergence of these methods. Along the way, we give simple convergence results for a wide variety of problems in machine learning: least squares, logistic regression, boosting, resilient backpropagation, L1-regularization, support vector machines, stochastic dual coordinate ascent, and stochastic variance-reduced gradient methods.

artificial intelligence, inequality, machine learning, (16 more...)

1608.04636

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Netherlands > South Holland > Dordrecht (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)

Aghazadeh, Amirali, Spring, Ryan, LeJeune, Daniel, Dasarathy, Gautam, Shrivastava, Anshumali, Baraniuk, Richard G.

MISSION: Ultra Large-Scale Feature Selection using Count-Sketches

Feature selection is an important challenge in machine learning. It plays a crucial role in the explainability of machine-driven decisions that are rapidly permeating throughout modern society. Unfortunately, the explosion in the size and dimensionality of real-world datasets poses a severe challenge to standard feature selection algorithms. Today, it is not uncommon for datasets to have billions of dimensions. At such scale, even storing the feature vector is impossible, causing most existing feature selection methods to fail. Workarounds like feature hashing, a standard approach to large-scale machine learning, helps with the computational feasibility, but at the cost of losing the interpretability of features. In this paper, we present MISSION, a novel framework for ultra large-scale feature selection that performs stochastic gradient descent while maintaining an efficient representation of the features in memory using a Count-Sketch data structure. MISSION retains the simplicity of feature hashing without sacrificing the interpretability of the features while using only O(log^2(p)) working memory. We demonstrate that MISSION accurately and efficiently performs feature selection on real-world, large-scale datasets with billions of dimensions.

algorithm, artificial intelligence, machine learning, (13 more...)

1806.0431

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Texas > Harris County > Houston (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

George, Thomas, Laurent, César, Bouthillier, Xavier, Ballas, Nicolas, Vincent, Pascal

Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis

For models with many parameters, the covariance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approximations and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.

artificial intelligence, freq, machine learning, (17 more...)

1806.03884

Country:

North America > Canada > Quebec > Montreal (0.04)
Oceania > Tonga (0.04)
North America > United States > Indiana > Hamilton County > Fishers (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Mescheder, Lars, Geiger, Andreas, Nowozin, Sebastian

Which Training Methods for GANs do actually Converge?

arXiv.org Artificial IntelligenceJun-11-2018

Recent work has shown local convergence of GAN training for absolutely continuous data and generator distributions. In this paper, we show that the requirement of absolute continuity is necessary: we describe a simple yet prototypical counterexample showing that in the more realistic case of distributions that are not absolutely continuous, unregularized GAN training is not always convergent. Furthermore, we discuss regularization strategies that were recently proposed to stabilize GAN training. Our analysis shows that GAN training with instance noise or zero-centered gradient penalties converges. On the other hand, we show that Wasserstein-GANs and WGAN-GP with a finite number of discriminator updates per generator update do not always converge to the equilibrium point. We discuss these results, leading us to a new explanation for the stability problems of GAN training. Based on our analysis, we extend our convergence results to more general GANs and prove local convergence for simplified gradient penalties even if the generator and data distribution lie on lower dimensional manifolds. We find these penalties to work well in practice and use them to learn high-resolution generative image models for a variety of datasets with little hyperparameter tuning.

artificial intelligence, discriminator, machine learning, (15 more...)

arXiv.org Artificial Intelligence

1801.04406

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(6 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)

Chen, Lingjiao, Wang, Hongyi, Zhao, Jinman, Papailiopoulos, Dimitris, Koutris, Paraschos

The Effect of Network Width on the Performance of Large-batch Training

arXiv.org Machine LearningJun-10-2018

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

artificial intelligence, machine learning, neural network, (17 more...)

1806.03791

Country: North America > United States > Wisconsin > Dane County > Madison (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

arXiv.org Machine LearningJun-8-2018

Adversarial Meta-Learning

Yin, Chengxiang, Tang, Jian, Xu, Zhiyuan, Wang, Yanzhi

Meta-learning enables a model to learn from very limited data to undertake a new task. In this paper, we study the general meta-learning with adversarial samples. We present a meta-learning algorithm, ADML (ADversarial Meta-Learner), which leverages clean and adversarial samples to optimize the initialization of a learning model in an adversarial manner. ADML leads to the following desirable properties: 1) it turns out to be very effective even in the cases with only clean samples; 2) it is model-agnostic, i.e., it is compatible with any learning model that can be trained with gradient descent; and most importantly, 3) it is robust to adversarial samples, i.e., unlike other meta-learning methods, it only leads to a minor performance degradation when there are adversarial samples. We show via extensive experiments that ADML delivers the state-of-the-art performance on two widely-used image datasets, MiniImageNet and CIFAR100, in terms of both accuracy and robustness.

adversarial sample, artificial intelligence, machine learning, (17 more...)

1806.03316

Country: Asia (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.38)

Weinshall, Daphna, Cohen, Gad, Amir, Dan

Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks

arXiv.org Artificial IntelligenceJun-8-2018

We provide theoretical investigation of curriculum learning in the context of stochastic gradient descent when optimizing the convex linear regression loss. We prove that the rate of convergence of an ideal curriculum learning method is monotonically increasing with the difficulty of the examples. Moreover, among all equally difficult points, convergence is faster when using points which incur higher loss with respect to the current hypothesis. We then analyze curriculum learning in the context of training a CNN. We describe a method which infers the curriculum by way of transfer learning from another network, pre-trained on a different task. While this approach can only approximate the ideal curriculum, we observe empirically similar behavior to the one predicted by the theory, namely, a significant boost in convergence speed at the beginning of training. When the task is made more difficult, improvement in generalization performance is also observed. Finally, curriculum learning exhibits robustness against unfavorable conditions such as excessive regularization.

curriculum, curriculum learning, learning, (15 more...)

arXiv.org Artificial Intelligence

1802.03796

Country:

Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
North America > United States > New Jersey (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Palacci, Henri, Hess, Henry

Scalable Natural Gradient Langevin Dynamics in Practice

arXiv.org Machine LearningJun-7-2018

Stochastic Gradient Langevin Dynamics (SGLD) is a sampling scheme for Bayesian modeling adapted to large datasets and models. SGLD relies on the injection of Gaussian Noise at each step of a Stochastic Gradient Descent (SGD) update. In this scheme, every component in the noise vector is independent and has the same scale, whereas the parameters we seek to estimate exhibit strong variations in scale and significant correlation structures, leading to poor convergence and mixing times. We compare different preconditioning approaches to the normalization of the noise vector and benchmark these approaches on the following criteria: 1) mixing times of the multivariate parameter vector, 2) regularizing effect on small dataset where it is easy to overfit, 3) covariate shift detection and 4) resistance to adversarial examples.

artificial intelligence, bayesian inference, machine learning, (17 more...)

1806.02855

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)

arXiv.org Machine LearningJun-7-2018

Stein Variational Gradient Descent Without Gradient

Han, Jun, Liu, Qiang

Stein variational gradient decent (SVGD) has been shown to be a powerful approximate inference algorithm for complex distributions. However, the standard SVGD requires calculating the gradient of the target density and cannot be applied when the gradient is unavailable. In this work, we develop a gradient-free variant of SVGD (GF-SVGD), which replaces the true gradient with a surrogate gradient, and corrects the induced bias by re-weighting the gradients in a proper form. We show that our GF-SVGD can be viewed as the standard SVGD with a special choice of kernel, and hence directly inherits the theoretical properties of SVGD. We shed insights on the empirical choice of the surrogate gradient and propose an annealed GF-SVGD that leverages the idea of simulated annealing to improve the performance on high dimensional complex distributions. Empirical studies show that our method consistently outperforms a number of recent advanced gradient-free MCMC methods.

artificial intelligence, machine learning, particle, (17 more...)

1806.02775

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)