AITopics

1910.08211

Country: North America > United States > Virginia (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
(2 more...)

Zhang, Ying, Akyildiz, Ömer Deniz, Damoulas, Theodoros, Sabanis, Sotirios

Nonasymptotic estimates for Stochastic Gradient Langevin Dynamics under local conditions in nonconvex optimization

arXiv.org Machine LearningOct-17-2019

Within the context of empirical risk minimization, see Raginsky, Rakhlin, and Telgarsky (2017), we are concerned with a non-asymptotic analysis of sampling algorithms used in optimization. In particular, we obtain non-asymptotic error bounds for a popular class of algorithms called Stochastic Gradient Langevin Dynamics (SGLD). These results are derived in appropriate Wasserstein distances in the absence of the log-concavity of the target distribution. More precisely, the local Lipschitzness of the stochastic gradient $H(\theta, x)$ is assumed, and furthermore, the dissipativity and convexity at infinity condition are relaxed by removing the uniform dependence in $x$.

algorithm, assumption 1, lemma 3, (12 more...)

1910.02008

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.81)

#artificialintelligenceOct-16-2019, 11:10:55 GMT

#003C Gradient Descent in Python Master Data Science

We will first import libraries as NumPy, matplotlib, pyplot and derivative function. Then with a NumPy function – linspace() we define our variable $w $ domain between 1.0 and 5.0 and 100 points. Also we define alpha which will represent learning rate. Next, we will define our $y $ ( in our case $J(w) $) and plot to see a convex function, we will use $(w-3) 2 $. So we can see that we plotted our convex function as an example.

algorithm, gradient descent, python master data science, (3 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)

arXiv.org Machine LearningOct-16-2019

A Double Residual Compression Algorithm for Efficient Distributed Learning

Liu, Xiaorui, Li, Yao, Tang, Jiliang, Yan, Ming

Large-scale machine learning models are often trained by parallel stochastic gradient descent algorithms. However, the communication cost of gradient aggregation and model synchronization between the master and worker nodes becomes the major obstacle for efficient learning as the number of workers and the dimension of the model increase. In this paper, we propose DORE, a DOuble REsidual compression stochastic gradient descent algorithm, to reduce over $95\%$ of the overall communication such that the obstacle can be immensely mitigated. Our theoretical analyses demonstrate that the proposed strategy has superior convergence properties for both strongly convex and nonconvex objective functions. The experimental results validate that DORE achieves the best communication efficiency while maintaining similar model accuracy and convergence speed in comparison with start-of-the-art baselines.

compression, doublesqueeze, gradient, (13 more...)

1910.07561

Country:

Europe > Sweden > Stockholm > Stockholm (0.04)
Asia > Middle East > Jordan (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(6 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)

Wang, Yuanhao, Zhang, Guodong, Ba, Jimmy

On Solving Minimax Optimization Locally: A Follow-the-Ridge Approach

arXiv.org Machine LearningOct-16-2019

Many tasks in modern machine learning can be formulated as finding equilibria in \emph{sequential} games. In particular, two-player zero-sum sequential games, also known as minimax optimization, have received growing interest. It is tempting to apply gradient descent to solve minimax optimization given its popularity and success in supervised learning. However, it has been noted that naive application of gradient descent fails to find some local minimax and can converge to non-local-minimax points. In this paper, we propose \emph{Follow-the-Ridge} (FR), a novel algorithm that provably converges to and only converges to local minimax. We show theoretically that the algorithm addresses the notorious rotational behaviour of gradient dynamics, and is compatible with preconditioning and \emph{positive} momentum. Empirically, FR solves toy minimax problems and improves the convergence of GAN training compared to the recent minimax optimization algorithms.

eigenvalue, local minimax, minimax, (15 more...)

1910.07512

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Orthogonal Gradient Descent for Continual Learning

Farajtabar, Mehrdad, Azizan, Navid, Mott, Alex, Li, Ang

Orthogonal Gradient Descent for Continual LearningMehrdad Farajtabar Navid Azizan 1 Alex Mott Ang Li DeepMind CalTech DeepMind DeepMind Abstract Neural networks are achieving state of the art and sometimes superhuman performance on learning tasks across a variety of domains. Whenever these problems require learning in a continual or sequential manner, however, neural networks suffer from the problem of catastrophic forgetting; they forget how to solve previous tasks after being trained on a new task, despite having the essential capacity to solve both tasks if they were trained on both simultaneously. In this paper, we propose to address this issue from a parameter space perspective and study an approach to restrict the direction of the gradient updates to avoid forgetting previously-learned data. We present the Orthogonal Gradient Descent (OGD) method, which accomplishes this goal by projecting the gradients from new tasks onto a subspace in which the neural network output on previous task does not change and the projected gradient is still in a useful direction for learning the new task. Our approach utilizes the high capacity of a neural network more efficiently and does not require storing the previously learned data that might raise privacy concerns. Experiments on common benchmarks reveal the effectiveness of the proposed OGD method. 1 Introduction One critical component of intelligence is the ability to learn continuously, when new information is constantly available but previously presented information is unavailable to retrieve. Despite their ubiquity in the real world, these problems have posed a longstanding challenge to artificial intelligence (Thrun and Mitchell, 1995; Hassabis et al., 2017).Correspondence to farajtabar@google.com. 1 Work done during an internship at DeepMind. A typical neural network training procedure over a sequence of different tasks usually results in degraded performance on previously trained tasks if the model could not revisit the data of previous tasks. This phenomenon is called catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990; French, 1999). Ideally, an intelligent agent should be able to learn consecutive tasks without degrading its performance on those already learned. With the deep learning renaissance (Krizhevsky et al., 2012; Hinton et al., 2006; Si-monyan and Zisserman, 2014) this problem has been revived (Srivastava et al., 2013; Goodfellow et al., 2013) with many followup studies (Parisi et al., 2019). One probable reason for this phenomenon is that neural networks are usually trained by Stochastic Gradient Descent (SGD)--or its variants--where the op-timizers produce gradients that are oblivious to past knowledge.

arxiv preprint arxiv, gradient, learning, (13 more...)

1910.07104

Genre: Research Report > New Finding (0.46)

Industry:

Education (0.68)
Health & Medicine (0.46)
Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Lei, Qi, Lee, Jason D., Dimakis, Alexandros G., Daskalakis, Constantinos

SGD Learns One-Layer Networks in WGANs

Generative adversarial networks (GANs) are a widely used framework for learning generative models. Wasserstein GANs (WGANs), one of the most successful variants of GANs, require solving a minmax optimization problem to global optimality, but are in practice successfully trained using stochastic gradient descent-ascent. In this paper, we show that, when the generator is a one-layer network, stochastic gradient descent-ascent converges to a global solution with polynomial time and sample complexity.

activation, diag, stationary point, (14 more...)

1910.0703

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.97)

He, Hangfeng, Su, Weijie J.

The Local Elasticity of Neural Networks

This paper presents a phenomenon in neural networks that we refer to as \textit{local elasticity}. Roughly speaking, a classifier is said to be locally elastic if its prediction at a feature vector $\bx'$ is \textit{not} significantly perturbed, after the classifier is updated via stochastic gradient descent at a (labeled) feature vector $\bx$ that is \textit{dissimilar} to $\bx'$ in a certain sense. This phenomenon is shown to persist for neural networks with nonlinear activation functions through extensive simulations on real-life and synthetic datasets, whereas this is not observed in linear classifiers. In addition, we offer a geometric interpretation of local elasticity using the neural tangent kernel \citep{jacot2018neural}. Building on top of local elasticity, we obtain pairwise similarity measures between feature vectors, which can be used for clustering in conjunction with $K$-means. The effectiveness of the clustering algorithm on the MNIST and CIFAR-10 datasets in turn corroborates the hypothesis of local elasticity of neural networks on real-life data. Finally, we discuss some implications of local elasticity to shed light on several intriguing aspects of deep neural networks.

local elasticity, neural network, similarity, (13 more...)

1910.06943

Country: North America > United States > Pennsylvania (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Extracting robust and accurate features via a robust information bottleneck

Pensia, Ankit, Jog, Varun, Loh, Po-Ling

We propose a novel strategy for extracting features in supervised learning that can be used to construct a classifier which is more robust to small perturbations in the input space. Our method builds upon the idea of the information bottleneck by introducing an additional penalty term that encourages the Fisher information of the extracted features to be small, when parametrized by the inputs. By tuning the regularization parameter, we can explicitly trade off the opposing desiderata of robustness and accuracy when constructing a classifier. We derive the optimal solution to the robust information bottleneck when the inputs and outputs are jointly Gaussian, proving that the optimally robust features are also jointly Gaussian in that setting. Furthermore, we propose a method for optimizing a variational bound on the robust information bottleneck objective in general settings using stochastic gradient descent, which may be implemented efficiently in neural networks. Our experimental results for synthetic and real data sets show that the proposed feature extraction method indeed produces classifiers with increased robustness to perturbations.

inequality, information, robustness, (16 more...)

1910.06893

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
(2 more...)

On Higher-order Moments in Adam

Jiang, Zhanhong, Balu, Aditya, Tan, Sin Yong, Lee, Young M, Hegde, Chinmay, Sarkar, Soumik

In this paper, we investigate the popular deep learning optimization routine, Adam, from the perspective of statistical moments. While Adam is an adaptive lower-order moment based (of the stochastic gradient) method, we propose an extension namely, HAdam, which uses higher order moments of the stochastic gradient. Our analysis and experiments reveal that certain higher-order moments of the stochastic gradient are able to achieve better performance compared to the vanilla Adam algorithm. We also provide some analysis of HAdam related to odd and even moments to explain some intriguing and seemingly non-intuitive empirical results.

gradient, hadam, stochastic gradient, (14 more...)

1910.06878

Country:

North America > United States > Iowa (0.05)
North America > Canada (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.80)