AITopics

Despite the great achievements of the modern deep neural networks (DNNs), the vulnerability/robustness of state-of-the-art DNNs raises security concerns in many application domains requiring high reliability. Various adversarial attacks are proposed to sabotage the learning performance of DNN models. Among those, the black-box adversarial attack methods have received special attentions owing to their practicality and simplicity. Black-box attacks usually prefer less queries in order to maintain stealthy and low costs. However, most of the current black-box attack methods adopt the first-order gradient descent method, which may come with certain deficiencies such as relatively slow convergence and high sensitivity to hyper-parameter settings. In this paper, we propose a zeroth-order natural gradient descent (ZO-NGD) method to design the adversarial attacks, which incorporates the zeroth-order gradient estimation technique catering to the black-box attack scenario and the second-order natural gradient descent to achieve higher query efficiency. The empirical evaluations on image classification datasets demonstrate that ZO-NGD can obtain significantly lower model query complexities compared with state-of-the-art attack methods.

adversarial attack, gradient, information, (15 more...)

2002.07891

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Indiana > Hamilton County > Fishers (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Transportation > Air (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Nguyen, Quynh, Mondelli, Marco

Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology

A recent line of research has provided convergence guarantees for gradient descent algorithms in the excessive over-parameterization regime where the widths of all the hidden layers are required to be polynomially large in the number of training samples. However, the widths of practical deep networks are often only large in the first layer(s) and then start to decrease towards the output layer. This raises an interesting open question whether similar results also hold under this empirically relevant setting. Existing theoretical insights suggest that the loss surface of this class of networks is well-behaved, but these results usually do not provide direct algorithmic guarantees for optimization. In this paper, we close the gap by showing that one wide layer followed by pyramidal deep network topology suffices for gradient descent to find a global minimum with a geometric rate. Our proof is based on a weak form of Polyak-Lojasiewicz inequality which holds for deep pyramidal networks in the manifold of full-rank weight matrices.

deep network, global convergence, wide layer followed, (12 more...)

2002.07867

Country:

Europe > Germany > Saarland (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.04)
Europe > Austria (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Is Local SGD Better than Minibatch SGD?

Woodworth, Blake, Patel, Kumar Kshitij, Stich, Sebastian U., Dai, Zhen, Bullins, Brian, McMahan, H. Brendan, Shamir, Ohad, Srebro, Nathan

It is often important to leverage parallelism in order to tackle large scale stochastic optimization problems. A prime example is the task of minimizing the loss of machine learning models with millions or billions of parameters over enormous training sets. One popular distributed approach is local stochastic gradient descent (SGD) (Zinkevich et al., 2010; Coppola, 2015; Zhou and Cong, 2018; Stich, 2018), also known as "parallel SGD" or "Federated Averaging" 1 (McMahan et al., 2016), which is commonly applied to large scale convex and non-convex stochastic optimization problems, including in data center and "Federated Learning" settings (Kairouz et al., 2019). Local SGD uses M parallel workers which, in each of R rounds, independently execute K steps of SGD starting from a common iterate, and then communicate and average their iterates to obtain the common iterate from which the next round begins. Overall, each machine computes T KR stochastic gradients and executes KR SGD steps locally, for a total of N KRM overall stochastic gradients computed (and so N KRM samples used), with R rounds of communication (every K steps of computation).

local sgd, minibatch sgd, sgd, (13 more...)

2002.07839

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)

Bordelon, Blake, Canatar, Abdulkadir, Pehlevan, Cengiz

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks

A fundamental question in modern machine learning is how deep neural networks can generalize. We address this question using 1) an equivalence between training infinitely wide neural networks and performing kernel regression with a deterministic kernel called the Neural Tangent Kernel (NTK) (Jacot et al. 2018), and 2) theoretical tools from statistical physics. We derive analytical expressions for learning curves for kernel regression, and use them to evaluate how the test loss of a trained neural network depends on the number of samples. Our approach allows us not only to compute the total test risk but also the decomposition of the risk due to different spectral components of the kernel. Complementary to recent results showing that during gradient descent, neural networks fit low frequency components first, we identify a new type of frequency principle: as the size of the training set size grows, kernel machines and neural networks begin to fit successively higher frequency modes of the target function. We verify our theory with simulations of kernel regression and training wide artificial neural networks.

kernel regression, neural network, regression, (13 more...)

2002.02561

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > New York (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Khaled, Ahmed, Richtárik, Peter

Better Theory for SGD in the Nonconvex World

Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic gradient. We show that our assumption is both more general and more reasonable than assumptions made in all prior work. Moreover, our results yield the optimal $\mathcal{O}(\varepsilon^{-4})$ rate for finding a stationary point of nonconvex smooth functions, and recover the optimal $\mathcal{O}(\varepsilon^{-1})$ rate for finding a global solution if the Polyak-{\L}ojasiewicz condition is satisfied. We compare against convergence rates under convexity and prove a theorem on the convergence of SGD under Quadratic Functional Growth and convexity, which might be of independent interest. Moreover, we perform our analysis in a framework which allows for a detailed study of the effects of a wide array of sampling strategies and minibatch sizes for finite-sum optimization problems. We corroborate our theoretical results with experiments on real and synthetic data.

assumption, better theory, sgd, (16 more...)

2002.03329

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.78)

Neural Information Processing SystemsFeb-15-2020, 19:56:57 GMT

Deep Learning with Kernel Regularization for Visual Recognition

Yu, Kai, Xu, Wei, Gong, Yihong

In this paper we focus on training deep neural networks for visual recognition tasks. One challenge is the lack of an informative regularization on the network parameters, to imply a meaningful control on the computed function. We propose a training strategy that takes advantage of kernel methods, where an existing kernel function represents useful prior knowledge about the learning task of interest. We derive an efficient algorithm using stochastic gradient descent, and demonstrate very positive results in a wide range of visual recognition tasks. Papers published at the Neural Information Processing Systems Conference.

deep learning, kernel regularization, visual recognition, (1 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.79)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)

Neural Information Processing SystemsFeb-15-2020, 19:27:52 GMT

Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation

Xu, Zhiqiang

Shift-and-invert preconditioning, as a classic acceleration technique for the leading eigenvector computation, has received much attention again recently, owing to fast least-squares solvers for efficiently approximating matrix inversions in power iterations. In this work, we adopt an inexact Riemannian gradient descent perspective to investigate this technique on the effect of the step-size scheme. The shift-and-inverted power method is included as a special case with adaptive step-sizes. Particularly, two other step-size settings, i.e., constant step-sizes and Barzilai-Borwein (BB) step-sizes, are examined theoretically and/or empirically. Our experimental studies show that the proposed algorithm can be significantly faster than the shift-and-inverted power method in practice.

eigenvector computation, gradient descent meet shift-and-invert preconditioning, shift-and-inverted power method

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Kiciman, Emre, Maltz, David, Platt, John C.

Fast Variational Inference for Large-scale Internet Diagnosis

Neural Information Processing SystemsFeb-15-2020, 05:42:26 GMT

Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 10 4 possible faults from 10 5 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a variational approximation, a mean-field approximation, and the use of stochastic gradient descent to optimize a variational cost function. We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service.

fast variational inference, inference, large-scale internet diagnosis, (1 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Communications > Web (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Roux, Nicolas L., Manzagol, Pierre-antoine, Bengio, Yoshua

Topmoumoute Online Natural Gradient Algorithm

Neural Information Processing SystemsFeb-15-2020, 05:13:37 GMT

Guided by the goal of obtaining an optimization algorithm that is both fast and yielding good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets. Papers published at the Neural Information Processing Systems Conference.

generalization error, gradient descent, topmoumoute online natural gradient algorithm

Country: Oceania > Tonga (0.31)

Genre: Research Report (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)

Zinkevich, Martin, Weimer, Markus, Li, Lihong, Smola, Alex J.

Parallelized Stochastic Gradient Descent

Neural Information Processing SystemsFeb-15-2020, 04:10:44 GMT

With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique --- contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime.

parallelized stochastic gradient descent, stochastic gradient descent algorithm

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)