Gradient Descent
On the Intrinsic Privacy of Stochastic Gradient Descent
Hyland, Stephanie L., Tople, Shruti
Stephanie L. Hyland Microsoft Research Shruti Tople Microsoft Research Abstract --Protecting the privacy of training data is important for the safe deployment of machine learning models. Private learning algorithms have been proposed that ensure strong differential-privacy (DP) guarantees. However, the additional noise required for such protection comes at the cost of reduced model utility. Meanwhile, the stochastic gradient descent (SGD) method -- the most common optimization algorithm for neural networks -- contains intrinsic randomness which has not been leveraged for privacy. Arguing that SGD guarantees intrinsic privacy, we investigate the extent to which this privacy can be quantified and used to improve the utility of privately learned models. In effect, we ask the question; "If SGD were a differentially-private mechanism, how good would it be?" In this work, we take the first step towards analysing the intrinsic privacy properties of SGD. Our primary contribution is a large-scale empirical analysis of SGD on both convex and non-convex objectives. T o this end, we evaluate the inherent variability due to the stochasticity in SGD on 3 different datasets and calculate the null values due to the intrinsic noise. First, we show that the variability in model parameters due to the random sampling almost always exceeds that due to changes in the data. We observe that SGD provides intrinsic null values of 7. 8, 6 .9 Next, we propose a method to augment the intrinsic noise of SGD with additional noise to achieve the desired null. Our augmented SGD outputs model that outperform existing approaches with the same privacy guarantee, thus closing the gap to noiseless utility between 0 . Finally, we show that the existing theoretical bound on the sensitivity of SGD is not tight. By estimating the tightest bound empirically, we achieve near-noiseless performance at null 1, closing the utility gap to the noiseless model between 3 . Our experiments provide concrete evidence that changing the seed in SGD is likely to have a far greater impact on the resulting model than including or excluding any given training example. By properly accounting for this intrinsic randomness, higher utility can be achieved without sacrificing further privacy. With these results, we hope to inspire the research community to further explore and characterise the randomness in SGD, its impact on privacy, and the parallels with generalisation in machine learning. I NTRODUCTION Respecting the privacy of users contributing their data to train machine learning models is important.
r/MachineLearning - [D] The gradient descent renaissance
The field of machine learning underwent massive changes in the 2010's. At the beginning, the field saw diverse approaches applied to a variety of topics and data structures. Then Alexnet blew away the competition for the Imagenet challenge with his CNN, and the field was forever changed. However, there was a warming up phase. Caffe's first release was in 2013.
Lower Bounds for Non-Convex Stochastic Optimization
Arjevani, Yossi, Carmon, Yair, Duchi, John C., Foster, Dylan J., Srebro, Nathan, Woodworth, Blake
We lower bound the complexity of finding $\epsilon$-stationary points (with gradient norm at most $\epsilon$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $\epsilon^{-4}$ queries to find an $\epsilon$ stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $\epsilon^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques.
Quantum-Inspired Hamiltonian Monte Carlo for Bayesian Sampling
Hamiltonian Monte Carlo (HMC) is an efficient Bayesian sampling method that can make distant proposals in the parameter space by simulating a Hamiltonian dynamical system. Despite its popularity in machine learning and data science, HMC is inefficient to sample from spiky and multimodal distributions. Motivated by the energy-time uncertainty relation from quantum mechanics, we propose a Quantum-Inspired Hamiltonian Monte Carlo algorithm (QHMC). This algorithm allows a particle to have a random mass with a probability distribution rather than a fixed mass. We prove the convergence property of QHMC in the spatial domain and in the time sequence. We further show why such a random mass can improve the performance when we sample a broad class of distributions. In order to handle the big training data sets in large-scale machine learning, we develop a stochastic gradient version of QHMC using Nos\'e-Hoover thermostat called QSGNHT, and we also provide theoretical justifications about its steady-state distributions. Finally in the experiments, we demonstrate the effectiveness of QHMC and QSGNHT on synthetic examples, bridge regression, image denoising and neural network pruning. The proposed QHMC and QSGNHT can indeed achieve much more stable and accurate sampling results on the test cases.
A Quasi-Newton Method Based Vertical Federated Learning Framework for Logistic Regression
Yang, Kai, Fan, Tao, Chen, Tianjian, Shi, Yuanming, Yang, Qiang
Data privacy and security becomes a major concern in building machine learning models from different data providers. Federated learning shows promise by leaving data at providers locally and exchanging encrypted information. This paper studies the vertical federated learning structure for logistic regression where the data sets at two parties have the same sample IDs but own disjoint subsets of features. Existing frameworks adopt the first-order stochastic gradient descent algorithm, which requires large number of communication rounds. To address the communication challenge, we propose a quasi-Newton method based vertical federated learning framework for logistic regression under the additively homomorphic encryption scheme. Our approach can considerably reduce the number of communication rounds with a little additional communication cost per round. Numerical results demonstrate the advantages of our approach over the first-order method.
On the geometry of Stein variational gradient descent
Duncan, A., Nuesken, N., Szpruch, L.
Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performs gains of these in various numerical experiments.
LOGAN: Latent Optimisation for Generative Adversarial Networks
Wu, Yan, Donahue, Jeff, Balduzzi, David, Simonyan, Karen, Lillicrap, Timothy
Training generative adversarial networks requires balancing of delicate adversarial dynamics. Even with careful tuning, training may diverge or end up in a bad equilibrium with dropped modes. In this work, we introduce a new form of latent optimisation inspired by the CS-GAN and show that it improves adversarial dynamics by enhancing interactions between the discriminator and the generator. We develop supporting theoretical analysis from the perspectives of differentiable games and stochastic approximation. Our experiments demonstrate that latent optimisation can significantly improve GAN training, obtaining state-of-the-art performance for the ImageNet ( 128 128) dataset. Our model achieves an Inception Score (IS) of 148 and an Fr echet Inception Distance (FID) of 3.4, an improvement of 17% and 32% in IS and FID respectively, compared with the baseline BigGAN-deep model with the same architecture and number of parameters. Generative Adversarial Nets (GANs) are implicit generative models that can be trained to match a given data distribution. GANs were originally developed by Goodfellow et al. (2014) for image data. As the field of generative modelling has advanced, GANs remain at the frontier, generating high-fidelity images at large scale (Brock et al., 2018).
Efficient Relaxed Gradient Support Pursuit for Sparsity Constrained Non-convex Optimization
Shang, Fanhua, Wei, Bingkun, Liu, Hongying, Liu, Yuanyuan, Zhuo, Jiacheng
Large-scale non-convex sparsity-constrained problems have recently gained extensive attention. Most existing deterministic optimization methods (e.g., GraSP) are not suitable for large-scale and high-dimensional problems, and thus stochastic optimization methods with hard thresholding (e.g., SVRGHT) become more attractive. Inspired by GraSP, this paper proposes a new general relaxed gradient support pursuit (RGraSP) framework, in which the sub-algorithm only requires to satisfy a slack descent condition. We also design two specific semi-stochastic gradient hard thresholding algorithms. In particular, our algorithms have much less hard thresholding operations than SVRGHT, and their average per-iteration cost is much lower (i.e., O(d) vs. O(d log(d)) for SVRGHT), which leads to faster convergence. Our experimental results on both synthetic and real-world datasets show that our algorithms are superior to the state-of-the-art gradient hard thresholding methods.
Risk Bounds for Low Cost Bipartite Ranking
Bipartite ranking is an important supervised learning problem; however, unlike regression or classification, it has a quadratic dependence on the number of samples. To circumvent the prohibitive sample cost, many recent work focus on stochastic gradient-based methods. In this paper we consider an alternative approach, which leverages the structure of the widely-adopted pairwise squared loss, to obtain a stochastic and low cost algorithm that does not require stochastic gradients or learning rates. Using a novel uniform risk bound---based on matrix and vector concentration inequalities---we show that the sample size required for competitive performance against the all-pairs batch algorithm does not have a quadratic dependence. Generalization bounds for both the batch and low cost stochastic algorithms are presented. Experimental results show significant speed gain against the batch algorithm, as well as competitive performance against state-of-the-art bipartite ranking algorithms on real datasets.
Announcement Regarding Successful Development of Gradient Descent (Backpropagation) Algorithm for Quantum Computers
Quantum computing has received significant attention as a next-generation computing technology due to its potential speed and ability to solve problems considered too difficult for classical computers, as reflected in the recent discussion on Quantum Supremacy. Grid sees quantum computing not only as a tool for solving optimization and quantum chemical computation problems, but also as a tool for AI (Machine Learning, Deep Learning, etc.) calculations, such as feature extraction. Previous works have announced the successful implementation of machine learning-related algorithms, such as principal component analysis and auto-encoders, on quantum computers. This work announces the development of a gradient descent (backpropagation) algorithm, a method commonly used in machine learning for neural network parameter optimization, for use on NISQ quantum computers. Due to the non-linear nature of quantum bits (qubits), Grid proposes that this algorithm can be used to perform the feature extraction and representation calculations that deep learning methods employ.