Goto

Collaborating Authors

 Gradient Descent


Towards Query-Efficient Black-Box Adversary with Zeroth-Order Natural Gradient Descent

arXiv.org Machine Learning

Despite the great achievements of the modern deep neural networks (DNNs), the vulnerability/robustness of state-of-the-art DNNs raises security concerns in many application domains requiring high reliability. Various adversarial attacks are proposed to sabotage the learning performance of DNN models. Among those, the black-box adversarial attack methods have received special attentions owing to their practicality and simplicity. Black-box attacks usually prefer less queries in order to maintain stealthy and low costs. However, most of the current black-box attack methods adopt the first-order gradient descent method, which may come with certain deficiencies such as relatively slow convergence and high sensitivity to hyper-parameter settings. In this paper, we propose a zeroth-order natural gradient descent (ZO-NGD) method to design the adversarial attacks, which incorporates the zeroth-order gradient estimation technique catering to the black-box attack scenario and the second-order natural gradient descent to achieve higher query efficiency. The empirical evaluations on image classification datasets demonstrate that ZO-NGD can obtain significantly lower model query complexities compared with state-of-the-art attack methods.


Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology

arXiv.org Machine Learning

A recent line of research has provided convergence guarantees for gradient descent algorithms in the excessive over-parameterization regime where the widths of all the hidden layers are required to be polynomially large in the number of training samples. However, the widths of practical deep networks are often only large in the first layer(s) and then start to decrease towards the output layer. This raises an interesting open question whether similar results also hold under this empirically relevant setting. Existing theoretical insights suggest that the loss surface of this class of networks is well-behaved, but these results usually do not provide direct algorithmic guarantees for optimization. In this paper, we close the gap by showing that one wide layer followed by pyramidal deep network topology suffices for gradient descent to find a global minimum with a geometric rate. Our proof is based on a weak form of Polyak-Lojasiewicz inequality which holds for deep pyramidal networks in the manifold of full-rank weight matrices.


Is Local SGD Better than Minibatch SGD?

arXiv.org Machine Learning

It is often important to leverage parallelism in order to tackle large scale stochastic optimization problems. A prime example is the task of minimizing the loss of machine learning models with millions or billions of parameters over enormous training sets. One popular distributed approach is local stochastic gradient descent (SGD) (Zinkevich et al., 2010; Coppola, 2015; Zhou and Cong, 2018; Stich, 2018), also known as "parallel SGD" or "Federated Averaging" 1 (McMahan et al., 2016), which is commonly applied to large scale convex and non-convex stochastic optimization problems, including in data center and "Federated Learning" settings (Kairouz et al., 2019). Local SGD uses M parallel workers which, in each of R rounds, independently execute K steps of SGD starting from a common iterate, and then communicate and average their iterates to obtain the common iterate from which the next round begins. Overall, each machine computes T KR stochastic gradients and executes KR SGD steps locally, for a total of N KRM overall stochastic gradients computed (and so N KRM samples used), with R rounds of communication (every K steps of computation).


Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks

arXiv.org Machine Learning

A fundamental question in modern machine learning is how deep neural networks can generalize. We address this question using 1) an equivalence between training infinitely wide neural networks and performing kernel regression with a deterministic kernel called the Neural Tangent Kernel (NTK) (Jacot et al. 2018), and 2) theoretical tools from statistical physics. We derive analytical expressions for learning curves for kernel regression, and use them to evaluate how the test loss of a trained neural network depends on the number of samples. Our approach allows us not only to compute the total test risk but also the decomposition of the risk due to different spectral components of the kernel. Complementary to recent results showing that during gradient descent, neural networks fit low frequency components first, we identify a new type of frequency principle: as the size of the training set size grows, kernel machines and neural networks begin to fit successively higher frequency modes of the target function. We verify our theory with simulations of kernel regression and training wide artificial neural networks.


Better Theory for SGD in the Nonconvex World

arXiv.org Machine Learning

Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic gradient. We show that our assumption is both more general and more reasonable than assumptions made in all prior work. Moreover, our results yield the optimal $\mathcal{O}(\varepsilon^{-4})$ rate for finding a stationary point of nonconvex smooth functions, and recover the optimal $\mathcal{O}(\varepsilon^{-1})$ rate for finding a global solution if the Polyak-{\L}ojasiewicz condition is satisfied. We compare against convergence rates under convexity and prove a theorem on the convergence of SGD under Quadratic Functional Growth and convexity, which might be of independent interest. Moreover, we perform our analysis in a framework which allows for a detailed study of the effects of a wide array of sampling strategies and minibatch sizes for finite-sum optimization problems. We corroborate our theoretical results with experiments on real and synthetic data.


Deep Learning with Kernel Regularization for Visual Recognition

Neural Information Processing Systems

In this paper we focus on training deep neural networks for visual recognition tasks. One challenge is the lack of an informative regularization on the network parameters, to imply a meaningful control on the computed function. We propose a training strategy that takes advantage of kernel methods, where an existing kernel function represents useful prior knowledge about the learning task of interest. We derive an efficient algorithm using stochastic gradient descent, and demonstrate very positive results in a wide range of visual recognition tasks. Papers published at the Neural Information Processing Systems Conference.


Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation

Neural Information Processing Systems

Shift-and-invert preconditioning, as a classic acceleration technique for the leading eigenvector computation, has received much attention again recently, owing to fast least-squares solvers for efficiently approximating matrix inversions in power iterations. In this work, we adopt an inexact Riemannian gradient descent perspective to investigate this technique on the effect of the step-size scheme. The shift-and-inverted power method is included as a special case with adaptive step-sizes. Particularly, two other step-size settings, i.e., constant step-sizes and Barzilai-Borwein (BB) step-sizes, are examined theoretically and/or empirically. Our experimental studies show that the proposed algorithm can be significantly faster than the shift-and-inverted power method in practice.


Fast Variational Inference for Large-scale Internet Diagnosis

Neural Information Processing Systems

Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 10 4 possible faults from 10 5 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a variational approximation, a mean-field approximation, and the use of stochastic gradient descent to optimize a variational cost function. We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service.


Topmoumoute Online Natural Gradient Algorithm

Neural Information Processing Systems

Guided by the goal of obtaining an optimization algorithm that is both fast and yielding good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets. Papers published at the Neural Information Processing Systems Conference.


Parallelized Stochastic Gradient Descent

Neural Information Processing Systems

With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique --- contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime.