Koloskova, Anastasia
The Privacy Power of Correlated Noise in Decentralized Learning
Allouah, Youssef, Koloskova, Anastasia, Firdoussi, Aymane El, Jaggi, Martin, Guerraoui, Rachid
Decentralized learning is appealing as it enables the scalable usage of large amounts of distributed data and resources (without resorting to any central entity), while promoting privacy since every user minimizes the direct exposure of their data. Yet, without additional precautions, curious users can still leverage models obtained from their peers to violate privacy. In this paper, we propose Decor, a variant of decentralized SGD with differential privacy (DP) guarantees. Essentially, in Decor, users securely exchange randomness seeds in one communication round to generate pairwise-canceling correlated Gaussian noises, which are injected to protect local models at every communication round. We theoretically and empirically show that, for arbitrary connected graphs, Decor matches the central DP optimal privacy-utility trade-off. We do so under SecLDP, our new relaxation of local DP, which protects all user communications against an external eavesdropper and curious users, assuming that every pair of connected users shares a secret, i.e., an information hidden to all others. The main theoretical challenge is to control the accumulation of non-canceling correlated noise due to network sparsity. We also propose a companion SecLDP privacy accountant for public use.
Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees
Koloskova, Anastasia, Hendrikx, Hadrien, Stich, Sebastian U.
Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of $c$ and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$ and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments.
Shuffle SGD is Always Better than SGD: Improved Analysis of SGD with Arbitrary Data Orders
Koloskova, Anastasia, Doikov, Nikita, Stich, Sebastian U., Jaggi, Martin
Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being popular choices for cycling through random or single permutations of the training data. However, the convergence properties of these algorithms in the non-convex case are not fully understood. Existing results suggest that, in realistic training scenarios where the number of epochs is smaller than the training set size, RR may perform worse than SGD. In this paper, we analyze a general SGD algorithm that allows for arbitrary data orderings and show improved convergence rates for non-convex functions. Specifically, our analysis reveals that SGD with random and single shuffling is always faster or at least as good as classical SGD with replacement, regardless of the number of iterations. Overall, our study highlights the benefits of using SGD with random/single shuffling and provides new insights into its convergence properties for non-convex optimization.
Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy
Koloskova, Anastasia, McKenna, Ryan, Charles, Zachary, Rush, Keith, McMahan, Brendan
We study gradient descent under linearly correlated noise. Our work is motivated by recent practical methods for optimization with differential privacy (DP), such as DP-FTRL, which achieve strong performance in settings where privacy amplification techniques are infeasible (such as in federated learning). These methods inject privacy noise through a matrix factorization mechanism, making the noise linearly correlated over iterations. We propose a simplified setting that distills key facets of these methods and isolates the impact of linearly correlated noise. We analyze the behavior of gradient descent in this setting, for both convex and non-convex functions. Our analysis is demonstrably tighter than prior work and recovers multiple important special cases exactly (including anticorrelated perturbed gradient descent). We use our results to develop new, effective matrix factorizations for differentially private optimization, and highlight the benefits of these factorizations theoretically and empirically.
Decentralized Local Stochastic Extra-Gradient for Variational Inequalities
Beznosikov, Aleksandr, Dvurechensky, Pavel, Koloskova, Anastasia, Samokhin, Valentin, Stich, Sebastian U, Gasnikov, Alexander
We consider distributed stochastic variational inequalities (VIs) on unbounded domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated Learning. Moreover, multiple local updates on the workers can be made for reducing the communication frequency between the workers. We extend the stochastic extragradient method to this very general setting and theoretically analyze its convergence rate in the strongly-monotone, monotone, and non-monotone (when a Minty solution exists) settings. The provided rates explicitly exhibit the dependence on network characteristics (e.g., mixing time), iteration counter, data heterogeneity, variance, number of devices, and other standard parameters. As a special case, our method and analysis apply to distributed stochastic saddle-point problems (SPP), e.g., to the training of Deep Generative Adversarial Networks (GANs) for which decentralized training has been reported to be extremely challenging. In experiments for the decentralized training of GANs we demonstrate the effectiveness of our proposed approach.
Decentralized Gradient Tracking with Local Steps
Liu, Yue, Lin, Tao, Koloskova, Anastasia, Stich, Sebastian U.
Gradient tracking (GT) is an algorithm designed for solving decentralized optimization problems over a network (such as training a machine learning model). A key feature of GT is a tracking mechanism that allows to overcome data heterogeneity between nodes. We develop a novel decentralized tracking mechanism, $K$-GT, that enables communication-efficient local updates in GT while inheriting the data-independence property of GT. We prove a convergence rate for $K$-GT on smooth non-convex functions and prove that it reduces the communication overhead asymptotically by a linear factor $K$, where $K$ denotes the number of local steps. We illustrate the robustness and effectiveness of this heterogeneity correction on convex and non-convex benchmark problems and on a non-convex neural network training task with the MNIST dataset.
RelaySum for Decentralized Deep Learning on Heterogeneous Data
Vogels, Thijs, He, Lie, Koloskova, Anastasia, Lin, Tao, Karimireddy, Sai Praneeth, Stich, Sebastian U., Jaggi, Martin
In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers. A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions. To tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning. RelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes. In contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum. We prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data. Our code is available at http://github.com/epfml/relaysgd.
Decentralized Deep Learning with Arbitrary Communication Compression
Koloskova, Anastasia, Lin, Tao, Stich, Sebastian U., Jaggi, Martin
Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD $-$ recently introduced and analyzed for strongly-convex objectives only $-$ converges under arbitrary high compression ratio on general non-convex functions at the rate $O\bigl(1/\sqrt{nT}\bigr)$ where $T$ denotes the number of iterations and $n$ the number of workers. The algorithm achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over distributed user devices, connected by a social network and (ii) in a datacenter (outperforming all-reduce time-wise).
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
Koloskova, Anastasia, Stich, Sebastian U., Jaggi, Martin
We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over $n$ machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by $\omega \leq 1$ ($\omega=1$ meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, $\mathcal{O}(1/(nT))$, is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time $\mathcal{O}(1/(\delta^2\omega) \log (1/\epsilon))$ for accuracy $\epsilon > 0$. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for $\omega > 0$ and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.
Efficient Greedy Coordinate Descent for Composite Problems
Karimireddy, Sai Praneeth, Koloskova, Anastasia, Stich, Sebastian U., Jaggi, Martin
Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension $n$, and requiring upto $n$ times fewer iterations. In this paper, we consider greedy updates that are based on subgradients for a class of non-smooth composite problems, which includes $L1$-regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of $n$, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy. Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.