Gradient Descent
A Secure Federated Learning Framework for Residential Short Term Load Forecasting
Husnoo, Muhammad Akbar, Anwar, Adnan, Hosseinzadeh, Nasser, Islam, Shama Naz, Mahmood, Abdun Naser, Doss, Robin
Smart meter measurements, though critical for accurate demand forecasting, face several drawbacks including consumers' privacy, data breach issues, to name a few. Recent literature has explored Federated Learning (FL) as a promising privacy-preserving machine learning alternative which enables collaborative learning of a model without exposing private raw data for short term load forecasting. Despite its virtue, standard FL is still vulnerable to an intractable cyber threat known as Byzantine attack carried out by faulty and/or malicious clients. Therefore, to improve the robustness of federated short-term load forecasting against Byzantine threats, we develop a state-of-the-art differentially private secured FL-based framework that ensures the privacy of the individual smart meter's data while protect the security of FL models and architecture. Our proposed framework leverages the idea of gradient quantization through the Sign Stochastic Gradient Descent (SignSGD) algorithm, where the clients only transmit the `sign' of the gradient to the control centre after local model training. As we highlight through our experiments involving benchmark neural networks with a set of Byzantine attack models, our proposed approach mitigates such threats quite effectively and thus outperforms conventional Fed-SGD models.
Statistical Inference with Stochastic Gradient Methods under $\phi$-mixing Data
Liu, Ruiqi, Chen, Xi, Shang, Zuofeng
Stochastic gradient descent (SGD) is a scalable and memory-efficient optimization algorithm for large datasets and stream data, which has drawn a great deal of attention and popularity. The applications of SGD-based estimators to statistical inference such as interval estimation have also achieved great success. However, most of the related works are based on i.i.d. observations or Markov chains. When the observations come from a mixing time series, how to conduct valid statistical inference remains unexplored. As a matter of fact, the general correlation among observations imposes a challenge on interval estimation. Most existing methods may ignore this correlation and lead to invalid confidence intervals. In this paper, we propose a mini-batch SGD estimator for statistical inference when the data is $\phi$-mixing. The confidence intervals are constructed using an associated mini-batch bootstrap SGD procedure. Using ``independent block'' trick from \cite{yu1994rates}, we show that the proposed estimator is asymptotically normal, and its limiting distribution can be effectively approximated by the bootstrap procedure. The proposed method is memory-efficient and easy to implement in practice. Simulation studies on synthetic data and an application to a real-world dataset confirm our theory.
Generalization and Stability of Interpolating Neural Networks with Minimal Width
Taheri, Hossein, Thrampoulidis, Christos
We investigate the generalization and optimization properties of shallow neural-network classifiers trained by gradient descent in the interpolating regime. Specifically, in a realizable scenario where model weights can achieve arbitrarily small training error $\epsilon$ and their distance from initialization is $g(\epsilon)$, we demonstrate that gradient descent with $n$ training data achieves training error $O(g(1/T)^2 /T)$ and generalization error $O(g(1/T)^2 /n)$ at iteration $T$, provided there are at least $m=\Omega(g(1/T)^4)$ hidden neurons. We then show that our realizable setting encompasses a special case where data are separable by the model's neural tangent kernel. For this and logistic-loss minimization, we prove the training loss decays at a rate of $\tilde O(1/ T)$ given polylogarithmic number of neurons $m=\Omega(\log^4 (T))$. Moreover, with $m=\Omega(\log^{4} (n))$ neurons and $T\approx n$ iterations, we bound the test loss by $\tilde{O}(1/n)$. Our results differ from existing generalization outcomes using the algorithmic-stability framework, which necessitate polynomial width and yield suboptimal generalization rates. Central to our analysis is the use of a new self-bounded weak-convexity property, which leads to a generalized local quasi-convexity property for sufficiently parameterized neural-network classifiers. Eventually, despite the objective's non-convexity, this leads to convergence and generalization-gap bounds that resemble those found in the convex setting of linear logistic regression.
On Generalization of Decentralized Learning with Separable Data
Taheri, Hossein, Thrampoulidis, Christos
Decentralized learning offers privacy and communication efficiency when data are naturally distributed among agents communicating over an underlying graph. Motivated by overparameterized learning settings, in which models are trained to zero training loss, we study algorithmic and generalization properties of decentralized learning with gradient descent on separable data. Specifically, for decentralized gradient descent (DGD) and a variety of loss functions that asymptote to zero at infinity (including exponential and logistic losses), we derive novel finite-time generalization bounds. This complements a long line of recent work that studies the generalization performance and the implicit bias of gradient descent over separable data, but has thus far been limited to centralized learning scenarios. Notably, our generalization bounds approximately match in order their centralized counterparts. Critical behind this, and of independent interest, is establishing novel bounds on the training loss and the rate-of-consensus of DGD for a class of self-bounded losses. Finally, on the algorithmic front, we design improved gradient-based routines for decentralized learning with separable data and empirically demonstrate orders-of-magnitude of speed-up in terms of both training and generalization performance.
Risk-aware linear bandits with convex loss
Saux, Patrick, Maillard, Odalric-Ambrym
In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
FedGiA: An Efficient Hybrid Algorithm for Federated Learning
Zhou, Shenglong, Li, Geoffrey Ye
Federated learning has shown its advances recently but is still facing many challenges, such as how algorithms save communication resources and reduce computational costs, and whether they converge. To address these critical issues, we propose a hybrid federated learning algorithm (FedGiA) that combines the gradient descent and the inexact alternating direction method of multipliers. The proposed algorithm is more communication- and computation-efficient than several state-of-the-art algorithms theoretically and numerically. Moreover, it also converges globally under mild conditions.
Causality-based Counterfactual Explanation for Classification Models
Duong, Tri Dung, Li, Qian, Xu, Guandong
Counterfactual explanation is one branch of interpretable machine learning that produces a perturbation sample to change the model's original decision. The generated samples can act as a recommendation for end-users to achieve their desired outputs. Most of the current counterfactual explanation approaches are the gradient-based method, which can only optimize the differentiable loss functions with continuous variables. Accordingly, the gradient-free methods are proposed to handle the categorical variables, which however have several major limitations: 1) causal relationships among features are typically ignored when generating the counterfactuals, possibly resulting in impractical guidelines for decision-makers; 2) the counterfactual explanation algorithm requires a great deal of effort into parameter tuning for dertermining the optimal weight for each loss functions which must be conducted repeatedly for different datasets and settings. In this work, to address the above limitations, we propose a prototype-based counterfactual explanation framework (ProCE). ProCE is capable of preserving the causal relationship underlying the features of the counterfactual data. In addition, we design a novel gradient-free optimization based on the multi-objective genetic algorithm that generates the counterfactual explanations for the mixed-type of continuous and categorical features. Numerical experiments demonstrate that our method compares favorably with state-of-the-art methods and therefore is applicable to existing prediction models. All the source codes and data are available at \url{https://github.com/tridungduong16/multiobj-scm-cf}.
ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!
Mishchenko, Konstantin, Malinovsky, Grigory, Stich, Sebastian, Richtรกrik, Peter
We introduce ProxSkip -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($\psi$) function. The canonical approach to solving such problems is via the proximal gradient descent (ProxGD) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $\psi$ in each iteration. In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation of the gradient, which is the case in many applications. ProxSkip allows for the expensive prox operator to be skipped in most iterations: while its iteration complexity is $\mathcal{O}\left(\kappa \log \frac{1}{\varepsilon}\right)$, where $\kappa$ is the condition number of $f$, the number of prox evaluations is $\mathcal{O}\left(\sqrt{\kappa} \log \frac{1}{\varepsilon}\right)$ only. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging. In this context, ProxSkip offers an effective acceleration of communication complexity. Unlike other local gradient-type methods, such as FedAvg, SCAFFOLD, S-Local-GD and FedLin, whose theoretical communication complexity is worse than, or at best matching, that of vanilla GD in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneity-bounding assumptions.
Convergence of stochastic gradient descent on parameterized sphere with applications to variational Monte Carlo simulation
Abrahamsen, Nilin, Ding, Zhiyan, Goldshlager, Gil, Lin, Lin
We analyze stochastic gradient descent (SGD) type algorithms on a high-dimensional sphere which is parameterized by a neural network up to a normalization constant. We provide a new algorithm for the setting of supervised learning and show its convergence both theoretically and numerically. We also provide the first proof of convergence for the unsupervised setting, which corresponds to the widely used variational Monte Carlo (VMC) method in quantum physics.
Interpreting learning in biological neural networks as zero-order optimization method
Recently, significant progress has been made regarding the statistical understanding of artificial neural networks (ANNs). ANNs are motivated by the functioning of the brain, but differ in several crucial aspects. In particular, the locality in the updating rule of the connection parameters in biological neural networks (BNNs) makes it biologically implausible that the learning of the brain is based on gradient descent. In this work, we look at the brain as a statistical method for supervised learning. The main contribution is to relate the local updating rule of the connection parameters in BNNs to a zero-order optimization method. It is shown that the expected values of the iterates implement a modification of gradient descent.