Gradient Descent
FedDec: Peer-to-peer Aided Federated Learning
Costantini, Marina, Neglia, Giovanni, Spyropoulos, Thrasyvoulos
Federated learning (FL) is a recent machine learning framework that allows multiple agents, each of them with their own dataset, to train a model collaboratively without sharing their data [1-4]. The federated setting assumes that all agents are connected to a server that can communicate with each of them and that is in charge of aggregating the agents' updates to obtain the global model. This is similar to parallel distributed (PD) model training [5-8], with one crucial difference: in the latter, the agents send gradients to the central server to update the parameter value with a gradient step, while in FL the agents send their own local parameters for the server to average them. This has an impact on the communication frequency required by each framework: in PD one round of communication between (usually all) the agents and the server has to happen every time a (mini-batch) stochastic gradient descent (SGD) step is taken at the nodes, while in FL (i) multiple SGD updates can happen before a new server communication round takes place (which in FL literature are usually called local updates), and (ii) not all devices need to engage in the server communication round (which is known as partial participation). This makes FL a much more suitable option for settings with a large number of agents and a limited communication bandwidth with the server. In contrast to the approaches described above, the decentralized setting does not rely on a central server for the aggregation of the nodes' updates.
Fast, Distribution-free Predictive Inference for Neural Networks with Coverage Guarantees
Gao, Yue, Raskutti, Garvesh, Willet, Rebecca
To assess the accuracy of parameter estimates or predictions without specific distributional knowledge of the data, the idea of re-sampling or sub-sampling on the available data has been long-established to construct prediction intervals, and there is a rich history in the statistics literature on the jackknife and bootstrap methods, see Stine (1985), Efron (1979), Quenouille (1949), Efron and Gong (1983). Among these re-sampling methods, leave-one-out methods (generally referred to as "cross-validation" or "jackknife") are widely used to assess or calibrate predictive accuracy, and can be found in a large line of literature (Stone, 1974, Geisser, 1975). While it has been demonstrated in a large body of past work with extensive evidence that jackknifetype methods have reliable empirical performance, the theoretical properties of these types of methods are studied relatively little until recently, see Steinberger and Leeb (2018), Bousquet and Elisseeff (2002). One of the most important results among these theoretically guaranteed works is Foygel Barber et al. (2019), which introduces a crucial modification compared to the traditional jackknife method that permits rigorous coverage guarantees of at least 1 2ฮฑ regardless of the distribution of the data points, for any algorithm that treats the training points symmetrically. We will revisit this work and give more relative details in Section 2.1. Although theoretically jackknife+ has been proven to have coverage guarantees without distributional assumptions, in practice, this method is computationally costly, since we need to train n (which is the training sample size) leave-one-out models from scratch to find the predictive interval. Especially for large and complicated models like neural networks, this computational cost is prohibitive. The goal of this paper is to provide a fast algorithm that provides similar theoretical coverage guarantees to those in jackknife+. To achieve this goal, we develop a new procedure, called Differentially Private Lazy Predictive Inference (DP-Lazy PI), which combines two ideas: lazy training of neural networks and differentially private stochcastic gradient descent (DP-SGD).
SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance
We study Stochastic Gradient Descent with AdaGrad stepsizes: a popular adaptive (self-tuning) method for first-order stochastic optimization. Despite being well studied, existing analyses of this method suffer from various shortcomings: they either assume some knowledge of the problem parameters, impose strong global Lipschitz conditions, or fail to give bounds that hold with high probability. We provide a comprehensive analysis of this basic method without any of these limitations, in both the convex and non-convex (smooth) cases, that additionally supports a general ``affine variance'' noise model and provides sharp rates of convergence in both the low-noise and high-noise~regimes.
A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree Spectral Bias of Neural Networks
Gorji, Ali, Amrollahi, Andisheh, Krause, Andreas
Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards ``simpler'' functions. Various notions of simplicity have been introduced to characterize this behavior. Here, we focus on the case of neural networks with discrete (zero-one), high-dimensional, inputs through the lens of their Fourier (Walsh-Hadamard) transforms, where the notion of simplicity can be captured through the degree of the Fourier coefficients. We empirically show that neural networks have a tendency to learn lower-degree frequencies. We show how this spectral bias towards low-degree frequencies can in fact hurt the neural network's generalization on real-world datasets. To remedy this we propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies. Our regularizer also helps avoid erroneous identification of low-degree frequencies, which further improves generalization. We extensively evaluate our regularizer on synthetic datasets to gain insights into its behavior. Finally, we show significantly improved generalization on four different datasets compared to standard neural networks and other relevant baselines.
Asymptotically efficient one-step stochastic gradient descent
Bensoussan, Alain, Brouste, Alexandre, Esstafa, Youssef
A generic, fast and asymptotically efficient method for parametric estimation is described. It is based on the stochastic gradient descent on the loglikelihood function corrected by a single step of the Fisher scoring algorithm. We show theoretically and by simulations in the i.i.d. setting that it is an interesting alternative to the usual stochastic gradient descent with averaging or the adaptative stochastic gradient descent.
Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond
Cha, Jaeyoung, Lee, Jaewook, Yun, Chulhee
We study convergence lower bounds of without-replacement stochastic gradient descent (SGD) for solving smooth (strongly-)convex finite-sum minimization problems. Unlike most existing results focusing on final iterate lower bounds in terms of the number of components $n$ and the number of epochs $K$, we seek bounds for arbitrary weighted average iterates that are tight in all factors including the condition number $\kappa$. For SGD with Random Reshuffling, we present lower bounds that have tighter $\kappa$ dependencies than existing bounds. Our results are the first to perfectly close the gap between lower and upper bounds for weighted average iterates in both strongly-convex and convex cases. We also prove weighted average iterate lower bounds for arbitrary permutation-based SGD, which apply to all variants that carefully choose the best permutation. Our bounds improve the existing bounds in factors of $n$ and $\kappa$ and thereby match the upper bounds shown for a recently proposed algorithm called GraB.
On the effectiveness of partial variance reduction in federated learning with heterogeneous data
Li, Bo, Schmidt, Mikkel N., Alstrรธm, Tommy S., Stich, Sebastian U.
Data heterogeneity across clients is a key challenge in federated learning. Prior works address this by either aligning client and server models or using control variates to correct client model drift. Although these methods achieve fast convergence in convex or simple non-convex problems, the performance in over-parameterized models such as deep neural networks is lacking. In this paper, we first revisit the widely used FedAvg algorithm in a deep neural network to understand how data heterogeneity influences the gradient updates across the neural network layers. We observe that while the feature extraction layers are learned efficiently by FedAvg, the substantial diversity of the final classification layers across clients impedes the performance. Motivated by this, we propose to correct model drift by variance reduction only on the final layers. We demonstrate that this significantly outperforms existing benchmarks at a similar or lower communication cost. We furthermore provide proof for the convergence rate of our algorithm.
Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances
Stochastic gradient descent (SGD) has become a cornerstone of neural network optimization, yet the noise introduced by SGD is often assumed to be uncorrelated over time, despite the ubiquity of epoch-based training. In this work, we challenge this assumption and investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum, limited to a quadratic loss. Our main contributions are twofold: first, we calculate the exact autocorrelation of the noise for training in epochs under the assumption that the noise is independent of small fluctuations in the weight vector; second, we explore the influence of correlations introduced by the epoch-based learning scheme on SGD dynamics. We find that for directions with a curvature greater than a hyperparameter-dependent crossover value, the results for uncorrelated noise are recovered. However, for relatively flat directions, the weight variance is significantly reduced. We provide an intuitive explanation for these results based on a crossover between correlation times, contributing to a deeper understanding of the dynamics of SGD in the presence of epoch-based noise correlations.
Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates
Zhang, Siqi, Choudhury, Sayantan, Stich, Sebastian U, Loizou, Nicolas
Distributed and federated learning algorithms and techniques associated primarily with minimization problems. However, with the increase of minimax optimization and variational inequality problems in machine learning, the necessity of designing efficient distributed/federated learning approaches for these problems is becoming more apparent. In this paper, we provide a unified convergence analysis of communication-efficient local training methods for distributed variational inequality problems (VIPs). Our approach is based on a general key assumption on the stochastic estimates that allows us to propose and analyze several novel local training algorithms under a single framework for solving a class of structured non-monotone VIPs. We present the first local gradient descent-accent algorithms with provable improved communication complexity for solving distributed variational inequalities on heterogeneous data. The general algorithmic framework recovers state-of-the-art algorithms and their sharp convergence guarantees when the setting is specialized to minimization or minimax optimization problems. Finally, we demonstrate the strong performance of the proposed algorithms compared to state-of-the-art methods when solving federated minimax optimization problems.
Target-based Surrogates for Stochastic Optimization
Lavington, Jonathan Wilder, Vaswani, Sharan, Babanezhad, Reza, Schmidt, Mark, Roux, Nicolas Le
We consider minimizing functions for which it is expensive to compute the (possibly stochastic) gradient. Such functions are prevalent in reinforcement learning, imitation learning and adversarial training. Our target optimization framework uses the (expensive) gradient computation to construct surrogate functions in a \emph{target space} (e.g. the logits output by a linear model for classification) that can be minimized efficiently. This allows for multiple parameter updates to the model, amortizing the cost of gradient computation. In the full-batch setting, we prove that our surrogate is a global upper-bound on the loss, and can be (locally) minimized using a black-box optimization algorithm. We prove that the resulting majorization-minimization algorithm ensures convergence to a stationary point of the loss. Next, we instantiate our framework in the stochastic setting and propose the $SSO$ algorithm, which can be viewed as projected stochastic gradient descent in the target space. This connection enables us to prove theoretical guarantees for $SSO$ when minimizing convex functions. Our framework allows the use of standard stochastic optimization algorithms to construct surrogates which can be minimized by any deterministic optimization method. To evaluate our framework, we consider a suite of supervised learning and imitation learning problems. Our experiments indicate the benefits of target optimization and the effectiveness of $SSO$.