Goto

Collaborating Authors

 Gradient Descent


Intermittent Pulling with Local Compensation for Communication-Efficient Federated Learning

arXiv.org Machine Learning

Federated Learning is a powerful machine learning paradigm to cooperatively train a global model with highly distributed data. A major bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) algorithm for large-scale Federated Learning is the communication overhead on pushing local gradients and pulling global model. In this paper, to reduce the communication complexity of Federated Learning, a novel approach named Pulling Reduction with Local Compensation (PRLC) is proposed. Specifically, each training node intermittently pulls the global model from the server in SGD iterations, resulting in that it is sometimes unsynchronized with the server. In such a case, it will use its local update to compensate the gap between the local model and the global model. Our rigorous theoretical analysis of PRLC achieves two important findings. First, we prove that the convergence rate of PRLC preserves the same order as the classical synchronous SGD for both strongly-convex and non-convex cases with good scalability due to the linear speedup with respect to the number of training nodes. Second, we show that PRLC admits lower pulling frequency than the existing pulling reduction method without local compensation. We also conduct extensive experiments on various machine learning models to validate our theoretical results. Experimental results show that our approach achieves a significant pulling reduction over the state-of-the-art methods, e.g., PRLC requiring only half of the pulling operations of LAG.


On Last-Layer Algorithms for Classification: Decoupling Representation from Uncertainty Estimation

arXiv.org Machine Learning

Uncertainty quantification for deep learning is a challenging open problem. Bayesian statistics offer a mathematically grounded framework to reason about uncertainties; however, approximate posteriors for modern neural networks still require prohibitive computational costs. We propose a family of algorithms which split the classification task into two stages: representation learning and uncertainty estimation. We compare four specific instances, where uncertainty estimation is performed via either an ensemble of Stochastic Gradient Descent or Stochastic Gradient Langevin Dynamics snapshots, an ensemble of bootstrapped logistic regressions, or via a number of Monte Carlo Dropout passes. We evaluate their performance in terms of \emph{selective} classification (risk-coverage), and their ability to detect out-of-distribution samples. Our experiments suggest there is limited value in adding multiple uncertainty layers to deep classifiers, and we observe that these simple methods strongly outperform a vanilla point-estimate SGD in some complex benchmarks like ImageNet.


Keyword-based Topic Modeling and Keyword Selection

arXiv.org Machine Learning

Certain type of documents such as tweets are collected by specifying a set of keywords. As topics of interest change with time it is beneficial to adjust keywords dynamically. The challenge is that these need to be specified ahead of knowing the forthcoming documents and the underlying topics. The future topics should mimic past topics of interest yet there should be some novelty in them. We develop a keyword-based topic model that dynamically selects a subset of keywords to be used to collect future documents. The generative process first selects keywords and then the underlying documents based on the specified keywords. The model is trained by using a variational lower bound and stochastic gradient optimization. The inference consists of finding a subset of keywords where given a subset the model predicts the underlying topic-word matrix for the unknown forthcoming documents. We compare the keyword topic model against a benchmark model using viral predictions of tweets combined with a topic model. The keyword-based topic model outperforms this sophisticated baseline model by 67%.


Zeroth-Order Algorithms for Nonconvex Minimax Problems with Improved Complexities

arXiv.org Machine Learning

In this paper, we study zeroth-order algorithms for minimax optimization problems that are nonconvex in one variable and strongly-concave in the other variable. Such minimax optimization problems have attracted significant attention lately due to their applications in modern machine learning tasks. We first design and analyze the Zeroth-Order Gradient Descent Ascent (\texttt{ZO-GDA}) algorithm, and provide improved results compared to existing works, in terms of oracle complexity. Next, we propose the Zeroth-Order Gradient Descent Multi-Step Ascent (\texttt{ZO-GDMSA}) algorithm that significantly improves the oracle complexity of \texttt{ZO-GDA}. We also provide stochastic version of \texttt{ZO-GDA} and \texttt{ZO-GDMSA} to handle stochastic nonconvex minimax problems, and provide oracle complexity results.


Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

arXiv.org Machine Learning

GSNR of a parameter is defined as the ratio between its gradient's squared mean and Previous work (Zhang et al., 2016; Hardt et al., 2015; Dziugaite & Roy, 2017) suggests that the The GSNR of a parameter is defined as the ratio between its gradient's squared mean and variance Previous work tried to use GSNR to conduct theoretical analysis on deep learning. For example, Rainforth et al. (2018) used GSNR to analyze variational bounds in Intuitively, GSNR measures the similarity of a parameter's gradients among different training samples. To reveal the mechanism of DNNs' good generalization ability, we show that the gradient descent We believe this is probably the key to DNNs' remarkable generalization ability. In the remainder of this paper we first analyze the relation between GSNR and generalization (Section 2). At a particular point of the parameter space, GSNR measures the consistency of a parameter's gradients across different data samples.


SGLB: Stochastic Gradient Langevin Boosting

arXiv.org Machine Learning

In this paper, we introduce Stochastic Gradient Langevin Boosting (SGLB) -- a powerful and efficient machine learning framework, which may deal with a wide range of loss functions and has provable generalization guarantees. The method is based on a special form of Langevin Diffusion equation specifically designed for gradient boosting. This allows us to guarantee the global convergence, while standard gradient boosting algorithms can guarantee only local optima, which is a problem for multimodal loss functions. To illustrate the advantages of SGLB, we apply it to a classification task with 0-1 loss function, which is known to be multimodal, and to a standard Logistic regression task that is convex. The algorithm is implemented as a part of the CatBoost gradient boosting library and outperforms classic gradient boosting methods.


Weakly Supervised Learning Meets Ride-Sharing User Experience Enhancement

arXiv.org Machine Learning

Weakly supervised learning aims at coping with scarce labeled data. Previous weakly supervised studies typically assume that there is only one kind of weak supervision in data. In many applications, however, raw data usually contains more than one kind of weak supervision at the same time. For example, in user experience enhancement from Didi, one of the largest online ride-sharing platforms, the ride comment data contains severe label noise (due to the subjective factors of passengers) and severe label distribution bias (due to the sampling bias). We call such a problem as "compound weakly supervised learning". In this paper, we propose the CWSL method to address this problem based on Didi ride-sharing comment data. Specifically, an instance reweighting strategy is employed to cope with severe label noise in comment data, where the weights for harmful noisy instances are small. Robust criteria like AUC rather than accuracy and the validation performance are optimized for the correction of biased data label. Alternating optimization and stochastic gradient methods accelerate the optimization on large-scale data. Experiments on Didi ride-sharing comment data clearly validate the effectiveness. We hope this work may shed some light on applying weakly supervised learning to complex real situations.


Dual Stochastic Natural Gradient Descent

arXiv.org Machine Learning

Although theoretically appealing, Stochastic Natural Gradient Des cent (SNGD) [1] is computationally expensive, it has been shown to be highly sensitiv e to the learning rate, and it is not guaranteed to be convergent. Converg ent Stochastic Natural Gradient Descent (CSNGD) [6] aims at solving the last two pr oblems. However, the computational expense of CSNGD is still unacceptab le when the number of parameters is large. In this paper we introduce the Dual Stochastic Natural Gradient Descent (DSNGD) where we take benefit of dually flat manifolds to obtain a robust alternative to SNGD which is also computation ally feasible. We start by reviewing dually flat manifold concepts in section 3. Then w e introduce exponential XY families, the mathematical model required for the application of DSNGD, in section 4. After that, in section 5 we introduce DSNGD in exponential XY families under a minimal parameterization. The same idea can be extended to exponential XY families which are overparameterized.


Adaptive Stochastic Optimization

arXiv.org Machine Learning

Optimization lies at the heart of machine learning and signal processing. Contemporary approaches based on the stochastic gradient method are non-adaptive in the sense that their implementation employs prescribed parameter values that need to be tuned for each application. This article summarizes recent research and motivates future work on adaptive stochastic optimization methods, which have the potential to offer significant computational savings when training large-scale systems.


Privacy Amplification of Iterative Algorithms via Contraction Coefficients

arXiv.org Machine Learning

We investigate the framework of privacy amplification by iteration, recently proposed by Feldman et al., from an information-theoretic lens. We demonstrate that differential privacy guarantees of iterative mappings can be determined by a direct application of contraction coefficients derived from strong data processing inequalities for f -divergences. In particular, by generalizing the Dobrushin's contraction coefficient for total variation distance to an f -divergence known as E Differential privacy (DP) [1, 2] has become the standard definition for designing privacy-preserving machine learning algorithms. One reason for its success is its operational significance, which can be best described in terms of binary hypothesis testing (see, e.g., [3, 4]). Nevertheless, it is often difficult to compute DP guarantees for applications where a high number of data accesses is needed for a single analysis [5, 6]. To obtain the DP parameters in such applications, which include machine learning models trained using stochastic gradient descent (SGD), one needs to resort to composition theorems which are often loose due to their generality. As a remedy, several variants of DP have been recently proposed [7-10] based on Rényi divergence. These variants enjoy better composition properties.