Goto

Collaborating Authors

 msgd







Revisit last-iterate convergence of mSGD under milder requirement on step size

Neural Information Processing Systems

Understanding convergence of SGD-based optimization algorithms can help deal with enormous machine learning problems. To ensure last-iterate convergence of SGD and momentum-based SGD (mSGD), the existing studies usually constrain the step size \epsilon_{n} to decay as \sum_{n 1} { \infty}\epsilon_{n} {2} \infty, which however is rather conservative and may lead to slow convergence in the early stage of the iteration. In this paper, we relax this requirement by studying an alternate step size for the mSGD. This implies that a larger step size, such as \epsilon_{n} \frac{1}{\sqrt{n}} can be utilized for accelerating the mSGD in the early stage. Under this new step size and some common conditions, we prove that the gradient norm of mSGD for non-convex loss functions asymptotically decays to zero.


Learning from Streaming Data when Users Choose

Su, Jinyan, Dean, Sarah

arXiv.org Artificial Intelligence

Moreover, due to the data-driven nature of digital platforms, interesting dynamics emerge among users and service In digital markets comprised of many competing providers: on the one hand, users choose amongst services, each user chooses between multiple providers based on the quality of their services; on the other service providers according to their preferences, hand, providers use the user data to improve and update and the chosen service makes use of the user data their services, affecting future user choices (Ginart et al., to incrementally improve its model. The service 2021; Kwon et al., 2022; Dean et al., 2024; Jagadeesan et al., providers' models influence which service the 2023a). For example, in personalized music streaming platform, user will choose at the next time step, and the a user chooses amongst different music streaming user's choice, in return, influences the model update, platforms based on how well they meet the user's needs.


Revisiting Outer Optimization in Adversarial Training

Dabouei, Ali, Taherkhani, Fariborz, Soleymani, Sobhan, Nasrabadi, Nasser M.

arXiv.org Artificial Intelligence

Despite the fundamental distinction between adversarial and natural training (AT and NT), AT methods generally adopt momentum SGD (MSGD) for the outer optimization. This paper aims to analyze this choice by investigating the overlooked role of outer optimization in AT. Our exploratory evaluations reveal that AT induces higher gradient norm and variance compared to NT. This phenomenon hinders the outer optimization in AT since the convergence rate of MSGD is highly dependent on the variance of the gradients. To this end, we propose an optimization method called ENGM which regularizes the contribution of each input example to the average mini-batch gradients. We prove that the convergence rate of ENGM is independent of the variance of the gradients, and thus, it is suitable for AT. We introduce a trick to reduce the computational cost of ENGM using empirical observations on the correlation between the norm of gradients w.r.t. the network parameters and input examples. Our extensive evaluations and ablation studies on CIFAR-10, CIFAR-100, and TinyImageNet demonstrate that ENGM and its variants consistently improve the performance of a wide range of AT methods.


Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

Zhao, Shen-Yi, Xie, Yin-Peng, Li, Wu-Jun

arXiv.org Machine Learning

Stochastic gradient descent (SGD) and its variants have been the dominating optimization methods in machine learning. Compared with small batch training, SGD with large batch training can better utilize the computational power of current multi-core systems like GPUs and can reduce the number of communication rounds in distributed training. Hence, SGD with large batch training has attracted more and more attention. However, existing empirical results show that large batch training typically leads to a drop of generalization accuracy. As a result, large batch training has also become a challenging topic. In this paper, we propose a novel method, called stochastic normalized gradient descent with momentum (SNGM), for large batch training. We theoretically prove that compared to momentum SGD (MSGD) which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the $\epsilon$-stationary point with the same computation complexity (total number of gradient computation). Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.


Adam Induces Implicit Weight Sparsity in Rectifier Neural Networks

Yaguchi, Atsushi, Suzuki, Taiji, Asano, Wataru, Nitta, Shuhei, Sakata, Yukinobu, Tanizawa, Akiyuki

arXiv.org Machine Learning

In recent years, deep neural networks (DNNs) have been applied to various machine leaning tasks, including image recognition, speech recognition, and machine translation. However, large DNN models are needed to achieve state-of-the-art performance, exceeding the capabilities of edge devices. Model reduction is thus needed for practical use. In this paper, we point out that deep learning automatically induces group sparsity of weights, in which all weights connected to an output channel (node) are zero, when training DNNs under the following three conditions: (1) rectified-linear-unit (ReLU) activations, (2) an $L_2$-regularized objective function, and (3) the Adam optimizer. Next, we analyze this behavior both theoretically and experimentally, and propose a simple model reduction method: eliminate the zero weights after training the DNN. In experiments on MNIST and CIFAR-10 datasets, we demonstrate the sparsity with various training setups. Finally, we show that our method can efficiently reduce the model size and performs well relative to methods that use a sparsity-inducing regularizer.