Gradient Descent
Projected Stein Variational Gradient Descent
The curse of dimensionality is a critical challenge in Bayesian inference for high dimensional parameters. In this work, we address this challenge by developing a projected Stein variational gradient descent (pSVGD) method, which projects the parameters into a subspace that is adaptively constructed using the gradient of the log-likelihood, and applies SVGD for the much lower-dimensional coefficients of the projection. We provide an upper bound for the projection error with respect to the posterior and demonstrate the accuracy (compared to SVGD) and scalability of pSVGD with respect to the number of parameters, samples, data points, and processor cores.
Momentum Improves Normalized SGD
We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to find an ษ-critical point in O(1/ษ 3.5) iterations, matching the best-known rates without accruing any logarithmic factors or dependence on dimension. We also provide an adaptive method that automatically improves convergence rates when the variance in the gradients is small. Finally, we show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks. The rise of deep learning has focused research attention on the problem of solving optimization problems that are high-dimensional, large-scale, and non-convex. Modern neural networks can have billions of parameters (high-dimensional) Raffel et al. (2019); Shazeer et al. (2017), are trained using datasets containing millions of examples (large scale) Deng et al. (2009) on objective functions that are non-convex. Because of these considerations, stochastic gradient descent (SGD) has emerged as the de-facto method-of-choice for training deep models.
Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection
Song, Qifan, Sun, Yan, Ye, Mao, Liang, Faming
Stochastic gradient Markov chain Monte Carlo (MCMC) algorithms have received much attention in Bayesian computing for big data problems, but they are only applicable to a small class of problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. This paper proposes an extended stochastic gradient MCMC lgoriathm which, by introducing appropriate latent variables, can be applied to more general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. Numerical studies show that the proposed algorithm is highly scalable and much more efficient than traditional MCMC algorithms. The proposed algorithms have much alleviated the pain of Bayesian methods in big data computing.
Achieving the fundamental convergence-communication tradeoff with Differentially Quantized Gradient Descent
Lin, Chung-Yi, Kostina, Victoria, Hassibi, Babak
The problem of reducing the communication cost in distributed training through gradient quantization is considered. For the class of smooth and strongly convex objective functions, we characterize the minimum achievable linear convergence rate for a given number of bits per problem dimension $n$. We propose Differentially Quantized Gradient Descent, a quantization algorithm with error compensation, and prove that it achieves the fundamental tradeoff between communication rate and convergence rate as $n$ goes to infinity. In contrast, the naive quantizer that compresses the current gradient directly fails to achieve that optimal tradeoff. Experimental results on both simulated and real-world least-squares problems confirm our theoretical analysis.
Faster On-Device Training Using New Federated Momentum Algorithm
Huo, Zhouyuan, Yang, Qian, Gu, Bin, Huang, Lawrence Carin. Heng
Mobile crowdsensing has gained significant attention in recent years and has become a critical paradigm for emerging Internet of Things applications. The sensing devices continuously generate a significant quantity of data, which provide tremendous opportunities to develop innovative intelligent applications. To utilize these data to train machine learning models while not compromising user privacy, federated learning has become a promising solution. However, there is little understanding of whether federated learning algorithms are guaranteed to converge. We reconsider model averaging in federated learning and formulate it as a gradient-based method with biased gradients. This novel perspective assists analysis of its convergence rate and provides a new direction for more acceleration. We prove for the first time that the federated averaging algorithm is guaranteed to converge for non-convex problems, without imposing additional assumptions. We further propose a novel accelerated federated learning algorithm and provide a convergence guarantee. Simulated federated learning experiments are conducted to train deep neural networks on benchmark datasets, and experimental results show that our proposed method converges faster than previous approaches.
Entropy Minimization vs. Diversity Maximization for Domain Adaptation
Wu, Xiaofu, hang, Suofei, Zhou, Quan, Yang, Zhen, Zhao, Chunming, Latecki, Longin Jan
Entropy minimization has been widely used in unsupervised domain adaptation (UDA). However, existing works reveal that entropy minimization only may result into collapsed trivial solutions. In this paper, we propose to avoid trivial solutions by further introducing diversity maximization. In order to achieve the possible minimum target risk for UDA, we show that diversity maximization should be elaborately balanced with entropy minimization, the degree of which can be finely controlled with the use of deep embedded validation in an unsupervised manner. The proposed minimal-entropy diversity maximization (MEDM) can be directly implemented by stochastic gradient descent without use of adversarial learning. Empirical evidence demonstrates that MEDM outperforms the state-of-the-art methods on four popular domain adaptation datasets.
Minimax Defense against Gradient-based Adversarial Attacks
Lindqvist, Blerta, Izmailov, Rauf
State-of-the-art adversarial attacks are aimed at neural network classifiers. By default, neural networks use gradient descent to minimize their loss function. The gradient of a classifier's loss function is used by gradient-based adversarial attacks to generate adversarially perturbed images. We pose the question whether another type of optimization could give neural network classifiers an edge. Here, we introduce a novel approach that uses minimax optimization to foil gradient-based adversarial attacks. Our minimax classifier is the discriminator of a generative adversarial network (GAN) that plays a minimax game with the GAN generator. In addition, our GAN generator projects all points onto a manifold that is different from the original manifold since the original manifold might be the cause of adversarial attacks. To measure the performance of our minimax defense, we use adversarial attacks - Carlini Wagner (CW), DeepFool, Fast Gradient Sign Method (FGSM) - on three datasets: MNIST, CIFAR-10 and German Traffic Sign (TRAFFIC). Against CW attacks, our minimax defense achieves 98.07% (MNIST-default 98.93%), 73.90% (CIFAR-10-default 83.14%) and 94.54% (TRAFFIC-default 96.97%). Against DeepFool attacks, our minimax defense achieves 98.87% (MNIST), 76.61% (CIFAR-10) and 94.57% (TRAFFIC). Against FGSM attacks, we achieve 97.01% (MNIST), 76.79% (CIFAR-10) and 81.41% (TRAFFIC). Our Minimax adversarial approach presents a significant shift in defense strategy for neural network classifiers.
Newton-ADMM: A Distributed GPU-Accelerated Optimizer for Multiclass Classification Problems
Fang, Chih-Hao, Kylasa, Sudhir B, Roosta, Fred, Mahoney, Michael W., Grama, Ananth
First-order optimization methods, such as stochastic gradient descent (SGD) and its variants, are widely used in machine learning applications due to their simplicity and low per-iteration costs. However, they often require larger numbers of iterations, with associated communication costs in distributed environments. In contrast, Newton-type methods, while having higher per-iteration costs, typically require a significantly smaller number of iterations, which directly translates to reduced communication costs. In this paper, we present a novel distributed optimizer for classification problems, which integrates a GPU-accelerated Newton-type solver with the global consensus formulation of Alternating Direction of Method Multipliers (ADMM). By leveraging the communication efficiency of ADMM, GPU-accelerated inexact-Newton solver, and an effective spectral penalty parameter selection strategy, we show that our proposed method (i) yields better generalization performance on several classification problems; (ii) significantly outperforms state-of-the-art methods in distributed time to solution; and (iii) offers better scaling on large distributed platforms.
Oracle lower bounds for stochastic gradient sampling algorithms
Chatterji, Niladri S., Bartlett, Peter L., Long, Philip M.
We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^{d}$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all these algorithms. We show that for every algorithm, there exists a well-conditioned strongly log-concave target density for which the distribution of points generated by the algorithm would be at least $\varepsilon$ away from the target in total variation distance if the number of gradient queries is less than $\Omega(\sigma^2 d/\varepsilon^2)$, where $\sigma^2 d$ is the variance of the stochastic gradient. Our lower bound follows by combining the ideas of Le Cam deficiency routinely used in the comparison of statistical experiments along with standard information theoretic tools used in lower bounding Bayes risk functions. To the best of our knowledge our results provide the first nontrivial dimension-dependent lower bound for this problem.