AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

Wang, Bao, Gu, Quanquan, Boedihardjo, March, Barekat, Farzin, Osher, Stanley J.

arXiv.org Machine LearningJun-28-2019

Machine learning (ML) models trained by differentially private stochastic gradient descent (DP-SGD) has much lower utility than the non-private ones. To mitigate this degradation, we propose a DP Laplacian smoothing SGD (DP-LSSGD) for privacy-preserving ML. At the core of DP-LSSGD is the Laplace smoothing operator, which smooths out the Gaussian noise vector used in the Gaussian mechanism. Under the same amount of noise used in the Gaussian mechanism, DP-LSSGD attains the same differential privacy guarantee, but a strictly better utility guarantee, excluding an intrinsic term which is usually dominated by the other terms, for convex optimization than DP-SGD by a factor which is much less than one. In practice, DP-LSSGD makes training both convex and nonconvex ML models more efficient and enables the trained models to generalize better. For ResNet20, under the same strong differential privacy guarantee, DP-LSSGD can lift the testing accuracy of the trained private model by more than $8$\% compared with DP-SGD. The proposed algorithm is simple to implement and the extra computational complexity and memory overhead compared with DP-SGD are negligible. DP-LSSGD is applicable to train a large variety of ML models, including deep neural nets. The code is available at \url{https://github.com/BaoWangMath/DP-LSSGD}.

artificial intelligence, dp-lssgd, machine learning, (15 more...)

arXiv.org Machine Learning

1906.12056

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.15)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.87)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Neural ODEs as the Deep Limit of ResNets with constant weights

Avelin, Benny, Nyström, Kaj

arXiv.org Machine LearningJun-28-2019

In this paper we prove that, in the deep limit, the stochastic gradient descent on a ResNet type deep neural network, where each layer share the same weight matrix, converges to the stochastic gradient descent for a Neural ODE and that the corresponding value/loss functions converge. Our result gives, in the context of minimization by stochastic gradient descent, a theoretical foundation for considering Neural ODEs as the deep limit of ResNets. Our proof is based on certain decay estimates for associated Fokker-Planck equations.

artificial intelligence, machine learning, theorem 2, (16 more...)

arXiv.org Machine Learning

1906.12183

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Sweden > Uppsala County > Uppsala (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Shen, Shuheng, Xu, Linli, Liu, Jingchang, Liang, Xianfeng, Cheng, Yifei

arXiv.org Machine LearningJun-28-2019

With the increase in the amount of data and the expansion of model scale, distributed parallel training becomes an important and successful technique to address the optimization challenges. Nevertheless, although distributed stochastic gradient descent (SGD) algorithms can achieve a linear iteration speedup, they are limited significantly in practice by the communication cost, making it difficult to achieve a linear time speedup. In this paper, we propose a computation and communication decoupled stochastic gradient descent (CoCoD-SGD) algorithm to run computation and communication in parallel to reduce the communication cost. We prove that CoCoD-SGD has a linear iteration speedup with respect to the total computation capability of the hardware resources. In addition, it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms. Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD: when training ResNet18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, CoCoD-SGD is up to 2-3$\times$ faster than traditional synchronous SGD.

artificial intelligence, communication, machine learning, (16 more...)

arXiv.org Machine Learning

1906.12043

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond

Hinder, Oliver, Sidford, Aaron, Sohoni, Nimit Sharad

arXiv.org Machine LearningJun-27-2019

In this paper, we provide near-optimal accelerated first-order methods for minimizing a broad class of smooth nonconvex functions that are strictly unimodal on all lines through a minimizer. This function class, which we call the class of smooth quasar-convex functions, is parameterized by a constant $\gamma \in (0,1]$, where $\gamma = 1$ encompasses the classes of smooth convex and star-convex functions, and smaller values of $\gamma$ indicate that the function can be "more nonconvex." We develop a variant of accelerated gradient descent that computes an $\epsilon$-approximate minimizer of a smooth $\gamma$-quasar-convex function with at most $O(\gamma^{-1} \epsilon^{-1/2} \log(\gamma^{-1} \epsilon^{-1}))$ total function and gradient evaluations. We also derive a lower bound of $\Omega(\gamma^{-1} \epsilon^{-1/2})$ on the number of gradient evaluations required by any deterministic first-order method in the worst case, showing that, up to a logarithmic factor, no deterministic first-order algorithm can improve upon ours.

artificial intelligence, gradient evaluation, machine learning, (16 more...)

arXiv.org Machine Learning

1906.11985

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets

Asadi, Amir R., Abbe, Emmanuel

arXiv.org Machine LearningJun-26-2019

We introduce a family of complexity measures for the hypotheses of neural nets, based on a multilevel relative entropy. These complexity measures take into account the multilevel structure of neural nets, as opposed to the classical relative entropy (KL-divergence) term derived from PAC-Bayesian bounds [1] or mutual information bounds [2, 3]. We derive these complexity measures by combining the technique of chaining mutual information (CMI) [4], an algorithm-dependent extension of the classical chaining technique paired with the mutual information bound [2], with the multilevel architecture of neural nets. It is observed in this paper that if a neural net is regularized in a multilevel manner as defined in Section 4, then one can readily construct hierarchical coverings with controlled diameters for its hypothesis set, and exploit this to obtain new multi-scale and algorithm-dependent generalization bounds and, in turn, new regularizers and training algorithms. The effect of such multilevel regularizations on the representation ability of neural nets has also been recently studied in [5, 6] for the special case where layers are nearly-identity functions as for ResNets [7].

artificial intelligence, machine learning, relative entropy, (14 more...)

arXiv.org Machine Learning

1906.11148

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Haruki, Kosuke, Suzuki, Taiji, Hamakawa, Yohei, Toda, Takeshi, Sakai, Ryuji, Ozawa, Masahiro, Kimura, Mitsuhiro

arXiv.org Machine LearningJun-25-2019

Large-batch stochastic gradient descent (SGD) is widely used for training in distributed deep learning because of its training-time efficiency, however, extremely large-batch SGD leads to poor generalization and easily converges to sharp minima, which prevents naive large-scale data-parallel SGD (DP-SGD) from converging to good minima. To overcome this difficulty, we propose gradient noise convolution (GNC), which effectively smooths sharper minima of the loss function. For DP-SGD, GNC utilizes so-called gradient noise, which is induced by stochastic gradient variation and convolved to the loss function as a smoothing effect. GNC computation can be performed by simply computing the stochastic gradient on each parallel worker and merging them, and is therefore extremely easy to implement. Due to convolving with the gradient noise, which tends to spread along a sharper direction of the loss function, GNC can effectively smooth sharp minima and achieve better generalization, whereas isotropic random noise cannot. We empirically show this effect by comparing GNC with isotropic random noise, and show that it achieves state-of-the-art generalization performance for large-scale deep neural network optimization.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Machine Learning

1906.10822

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)

Add feedback

Adaptive Learning Rate Clipping Stabilizes Learning

Ede, Jeffrey M., Beanland, Richard

arXiv.org Machine LearningJun-24-2019

Artificial neural network training with stochastic gradient descent can be destabilized by "bad batches" with high losses. This is often problematic for training with small batch sizes, high order loss functions or unstably high learning rates. To stabilize learning, we have developed adaptive learning rate clipping (ALRC) to limit backpropagated losses to a number of standard deviations above their running means. ALRC is designed to complement existing learning algorithms: Our algorithm is computationally inexpensive, can be applied to any loss function or batch size, is robust to hyperparameter choices and does not affect backpropagated gradient distributions. Experiments with CIFAR-10 supersampling show that ALCR decreases errors for unstable mean quartic error training while stable mean squared error training is unaffected. We also show that ALRC decreases unstable mean squared errors for partial scanning transmission electron micrograph completion. Our source code is publicly available at https://github.com/Jeffrey-Ede/ALRC

alrc, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1906.0906

Country:

Europe > United Kingdom (0.05)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

The Value of Collaboration in Convex Machine Learning with Differential Privacy

Wu, Nan, Farokhi, Farhad, Smith, David, Kaafar, Mohamed Ali

arXiv.org Machine LearningJun-23-2019

In this paper, we apply machine learning to distributed private data owned by multiple data owners, entities with access to non-overlapping training datasets. We use noisy, differentially-private gradients to minimize the fitness cost of the machine learning model using stochastic gradient descent. We quantify the quality of the trained model, using the fitness cost, as a function of privacy budget and size of the distributed datasets to capture the trade-off between privacy and utility in machine learning. This way, we can predict the outcome of collaboration among privacy-aware data owners prior to executing potentially computationally-expensive machine learning algorithms. Particularly, we show that the difference between the fitness of the trained machine learning model using differentially-private gradient queries and the fitness of the trained machine model in the absence of any privacy concerns is inversely proportional to the size of the training datasets squared and the privacy budget squared. We successfully validate the performance prediction with the actual performance of the proposed privacy-aware learning algorithms, applied to: financial datasets for determining interest rates of loans using regression; and detecting credit card frauds using support vector machines.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Machine Learning

1906.09679

Country:

Oceania > Australia (0.14)
North America > United States > California (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Add feedback

True Gradient Descent

#artificialintelligenceJun-22-2019, 18:22:29 GMT

In machine learning, the effectiveness of a network is measured by an error function which measures how good a network is at its job. In general, the higher the error, the worse the network is. Conversely, the lower the error, the better the network is. When a machine is trying to learn how to do a task, it tries to make as few mistakes as possible; that is, minimize its error function. True gradient descent is the application of gradient descent to a machine learning network to minimize an error function.

artificial intelligence, error function, machine learning, (2 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)

Add feedback

Beneficial perturbation network for continual learning

Wen, Shixian, Itti, Laurent

arXiv.org Artificial IntelligenceJun-22-2019

Sequential learning of multiple tasks in artificial neural networks using gradient descent leads to catastrophic forgetting, whereby previously learned knowledge is erased during learning of new, disjoint knowledge. Here, we propose a fundamentally new type of method - Beneficial Perturbation Network (BPN). We add task-dependent memory (biasing) units to allow the network to operate in different regimes for different tasks. We compute the most beneficial directions to train these units, in a manner inspired by recent work on adversarial examples. At test time, beneficial perturbations for a given task bias the network toward that task to overcome catastrophic forgetting. BPN is not only more parameter-efficient than network expansion methods, but also does not need to store any data from previous tasks, in contrast with episodic memory methods. Experiments on variants of the MNIST, CIFAR-10, CIFAR-100 datasets demonstrate strong performance of BPN when compared to the state-of-the-art.

artificial intelligence, machine learning, memory unit, (17 more...)

arXiv.org Artificial Intelligence

1906.10528

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback