Goto

Collaborating Authors

 Gradient Descent


DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

arXiv.org Machine Learning

Machine learning (ML) models trained by differentially private stochastic gradient descent (DP-SGD) has much lower utility than the non-private ones. To mitigate this degradation, we propose a DP Laplacian smoothing SGD (DP-LSSGD) for privacy-preserving ML. At the core of DP-LSSGD is the Laplace smoothing operator, which smooths out the Gaussian noise vector used in the Gaussian mechanism. Under the same amount of noise used in the Gaussian mechanism, DP-LSSGD attains the same differential privacy guarantee, but a strictly better utility guarantee, excluding an intrinsic term which is usually dominated by the other terms, for convex optimization than DP-SGD by a factor which is much less than one. In practice, DP-LSSGD makes training both convex and nonconvex ML models more efficient and enables the trained models to generalize better. For ResNet20, under the same strong differential privacy guarantee, DP-LSSGD can lift the testing accuracy of the trained private model by more than $8$\% compared with DP-SGD. The proposed algorithm is simple to implement and the extra computational complexity and memory overhead compared with DP-SGD are negligible. DP-LSSGD is applicable to train a large variety of ML models, including deep neural nets. The code is available at \url{https://github.com/BaoWangMath/DP-LSSGD}.


Neural ODEs as the Deep Limit of ResNets with constant weights

arXiv.org Machine Learning

In this paper we prove that, in the deep limit, the stochastic gradient descent on a ResNet type deep neural network, where each layer share the same weight matrix, converges to the stochastic gradient descent for a Neural ODE and that the corresponding value/loss functions converge. Our result gives, in the context of minimization by stochastic gradient descent, a theoretical foundation for considering Neural ODEs as the deep limit of ResNets. Our proof is based on certain decay estimates for associated Fokker-Planck equations.


Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

arXiv.org Machine Learning

With the increase in the amount of data and the expansion of model scale, distributed parallel training becomes an important and successful technique to address the optimization challenges. Nevertheless, although distributed stochastic gradient descent (SGD) algorithms can achieve a linear iteration speedup, they are limited significantly in practice by the communication cost, making it difficult to achieve a linear time speedup. In this paper, we propose a computation and communication decoupled stochastic gradient descent (CoCoD-SGD) algorithm to run computation and communication in parallel to reduce the communication cost. We prove that CoCoD-SGD has a linear iteration speedup with respect to the total computation capability of the hardware resources. In addition, it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms. Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD: when training ResNet18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, CoCoD-SGD is up to 2-3$\times$ faster than traditional synchronous SGD.


Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond

arXiv.org Machine Learning

In this paper, we provide near-optimal accelerated first-order methods for minimizing a broad class of smooth nonconvex functions that are strictly unimodal on all lines through a minimizer. This function class, which we call the class of smooth quasar-convex functions, is parameterized by a constant $\gamma \in (0,1]$, where $\gamma = 1$ encompasses the classes of smooth convex and star-convex functions, and smaller values of $\gamma$ indicate that the function can be "more nonconvex." We develop a variant of accelerated gradient descent that computes an $\epsilon$-approximate minimizer of a smooth $\gamma$-quasar-convex function with at most $O(\gamma^{-1} \epsilon^{-1/2} \log(\gamma^{-1} \epsilon^{-1}))$ total function and gradient evaluations. We also derive a lower bound of $\Omega(\gamma^{-1} \epsilon^{-1/2})$ on the number of gradient evaluations required by any deterministic first-order method in the worst case, showing that, up to a logarithmic factor, no deterministic first-order algorithm can improve upon ours.


Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets

arXiv.org Machine Learning

We introduce a family of complexity measures for the hypotheses of neural nets, based on a multilevel relative entropy. These complexity measures take into account the multilevel structure of neural nets, as opposed to the classical relative entropy (KL-divergence) term derived from PAC-Bayesian bounds [1] or mutual information bounds [2, 3]. We derive these complexity measures by combining the technique of chaining mutual information (CMI) [4], an algorithm-dependent extension of the classical chaining technique paired with the mutual information bound [2], with the multilevel architecture of neural nets. It is observed in this paper that if a neural net is regularized in a multilevel manner as defined in Section 4, then one can readily construct hierarchical coverings with controlled diameters for its hypothesis set, and exploit this to obtain new multi-scale and algorithm-dependent generalization bounds and, in turn, new regularizers and training algorithms. The effect of such multilevel regularizations on the representation ability of neural nets has also been recently studied in [5, 6] for the special case where layers are nearly-identity functions as for ResNets [7].


Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

arXiv.org Machine Learning

Large-batch stochastic gradient descent (SGD) is widely used for training in distributed deep learning because of its training-time efficiency, however, extremely large-batch SGD leads to poor generalization and easily converges to sharp minima, which prevents naive large-scale data-parallel SGD (DP-SGD) from converging to good minima. To overcome this difficulty, we propose gradient noise convolution (GNC), which effectively smooths sharper minima of the loss function. For DP-SGD, GNC utilizes so-called gradient noise, which is induced by stochastic gradient variation and convolved to the loss function as a smoothing effect. GNC computation can be performed by simply computing the stochastic gradient on each parallel worker and merging them, and is therefore extremely easy to implement. Due to convolving with the gradient noise, which tends to spread along a sharper direction of the loss function, GNC can effectively smooth sharp minima and achieve better generalization, whereas isotropic random noise cannot. We empirically show this effect by comparing GNC with isotropic random noise, and show that it achieves state-of-the-art generalization performance for large-scale deep neural network optimization.


Adaptive Learning Rate Clipping Stabilizes Learning

arXiv.org Machine Learning

Artificial neural network training with stochastic gradient descent can be destabilized by "bad batches" with high losses. This is often problematic for training with small batch sizes, high order loss functions or unstably high learning rates. To stabilize learning, we have developed adaptive learning rate clipping (ALRC) to limit backpropagated losses to a number of standard deviations above their running means. ALRC is designed to complement existing learning algorithms: Our algorithm is computationally inexpensive, can be applied to any loss function or batch size, is robust to hyperparameter choices and does not affect backpropagated gradient distributions. Experiments with CIFAR-10 supersampling show that ALCR decreases errors for unstable mean quartic error training while stable mean squared error training is unaffected. We also show that ALRC decreases unstable mean squared errors for partial scanning transmission electron micrograph completion. Our source code is publicly available at https://github.com/Jeffrey-Ede/ALRC


The Value of Collaboration in Convex Machine Learning with Differential Privacy

arXiv.org Machine Learning

In this paper, we apply machine learning to distributed private data owned by multiple data owners, entities with access to non-overlapping training datasets. We use noisy, differentially-private gradients to minimize the fitness cost of the machine learning model using stochastic gradient descent. We quantify the quality of the trained model, using the fitness cost, as a function of privacy budget and size of the distributed datasets to capture the trade-off between privacy and utility in machine learning. This way, we can predict the outcome of collaboration among privacy-aware data owners prior to executing potentially computationally-expensive machine learning algorithms. Particularly, we show that the difference between the fitness of the trained machine learning model using differentially-private gradient queries and the fitness of the trained machine model in the absence of any privacy concerns is inversely proportional to the size of the training datasets squared and the privacy budget squared. We successfully validate the performance prediction with the actual performance of the proposed privacy-aware learning algorithms, applied to: financial datasets for determining interest rates of loans using regression; and detecting credit card frauds using support vector machines.


True Gradient Descent

#artificialintelligence

In machine learning, the effectiveness of a network is measured by an error function which measures how good a network is at its job. In general, the higher the error, the worse the network is. Conversely, the lower the error, the better the network is. When a machine is trying to learn how to do a task, it tries to make as few mistakes as possible; that is, minimize its error function. True gradient descent is the application of gradient descent to a machine learning network to minimize an error function.


Beneficial perturbation network for continual learning

arXiv.org Artificial Intelligence

Sequential learning of multiple tasks in artificial neural networks using gradient descent leads to catastrophic forgetting, whereby previously learned knowledge is erased during learning of new, disjoint knowledge. Here, we propose a fundamentally new type of method - Beneficial Perturbation Network (BPN). We add task-dependent memory (biasing) units to allow the network to operate in different regimes for different tasks. We compute the most beneficial directions to train these units, in a manner inspired by recent work on adversarial examples. At test time, beneficial perturbations for a given task bias the network toward that task to overcome catastrophic forgetting. BPN is not only more parameter-efficient than network expansion methods, but also does not need to store any data from previous tasks, in contrast with episodic memory methods. Experiments on variants of the MNIST, CIFAR-10, CIFAR-100 datasets demonstrate strong performance of BPN when compared to the state-of-the-art.