Gradient Descent
Bridging Differential Privacy and Byzantine-Robustness via Model Aggregation
This paper aims at jointly addressing two seemly conflicting issues in federated learning: differential privacy (DP) and Byzantine-robustness, which are particularly challenging when the distributed data are non-i.i.d. (independent and identically distributed). The standard DP mechanisms add noise to the transmitted messages, and entangles with robust stochastic gradient aggregation to defend against Byzantine attacks. In this paper, we decouple the two issues via robust stochastic model aggregation, in the sense that our proposed DP mechanisms and the defense against Byzantine attacks have separated influence on the learning performance. Leveraging robust stochastic model aggregation, at each iteration, each worker calculates the difference between the local model and the global one, followed by sending the element-wise signs to the master node, which enables robustness to Byzantine attacks. Further, we design two DP mechanisms to perturb the uploaded signs for the purpose of privacy preservation, and prove that they are $(\epsilon,0)$-DP by exploiting the properties of noise distributions. With the tools of Moreau envelop and proximal point projection, we establish the convergence of the proposed algorithm when the cost function is nonconvex. We analyze the trade-off between privacy preservation and learning performance, and show that the influence of our proposed DP mechanisms is decoupled with that of robust stochastic model aggregation. Numerical experiments demonstrate the effectiveness of the proposed algorithm.
Gradient-descent quantum process tomography by learning Kraus operators
Ahmed, Shahnawaz, Quijandrรญa, Fernando, Kockum, Anton Frisk
We perform quantum process tomography (QPT) for both discrete- and continuous-variable quantum systems by learning a process representation using Kraus operators. The Kraus form ensures that the reconstructed process is completely positive. To make the process trace-preserving, we use a constrained gradient-descent (GD) approach on the so-called Stiefel manifold during optimization to obtain the Kraus operators. Our ansatz uses a few Kraus operators to avoid direct estimation of large process matrices, e.g., the Choi matrix, for low-rank quantum processes. The GD-QPT matches the performance of both compressed-sensing (CS) and projected least-squares (PLS) QPT in benchmarks with two-qubit random processes, but shines by combining the best features of these two methods. Similar to CS (but unlike PLS), GD-QPT can reconstruct a process from just a small number of random measurements, and similar to PLS (but unlike CS) it also works for larger system sizes, up to at least five qubits. We envisage that the data-driven approach of GD-QPT can become a practical tool that greatly reduces the cost and computational effort for QPT in intermediate-scale quantum systems.
Formal guarantees for heuristic optimization algorithms used in machine learning
Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization of machine learning (ML) problems. A variety of strategies have been proposed for tuning the step sizes, ranging from adaptive step sizes to heuristic methods to change the step size in each iteration. Also, momentum has been widely employed in ML tasks to accelerate the training process. Yet, there is a gap in our theoretical understanding of them. In this work, we start to close this gap by providing formal guarantees to a few heuristic optimization methods and proposing improved algorithms. First, we analyze a generalized version of the AdaGrad (Delayed AdaGrad) step sizes in both convex and non-convex settings, showing that these step sizes allow the algorithms to automatically adapt to the level of noise of the stochastic gradients. We show for the first time sufficient conditions for Delayed AdaGrad to achieve almost sure convergence of the gradients to zero. Moreover, we present a high probability analysis for Delayed AdaGrad and its momentum variant in the non-convex setting. Second, we analyze SGD with exponential and cosine step sizes, which are empirically successful but lack theoretical support. We provide the very first convergence guarantees for them in the smooth and non-convex setting, with and without the Polyak-{\L}ojasiewicz (PL) condition. We also show their good property of adaptivity to noise under the PL condition. Third, we study the last iterate of momentum methods. We prove the first lower bound in the convex setting for the last iterate of SGD with constant momentum. Moreover, we investigate a class of Follow-The-Regularized-Leader-based momentum algorithms with increasing momentum and shrinking updates. We show that their last iterate has optimal convergence for unconstrained convex stochastic optimization problems.
Breaking it Down: Gradient Descent
Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Gradient descent is an optimization algorithm that is used to improve the performance of deep/machine learning models.
Differentially Private SGDA for Minimax Problems
Yang, Zhenhuan, Hu, Shu, Lei, Yunwen, Varshney, Kush R., Lyu, Siwei, Ying, Yiming
Stochastic gradient descent ascent (SGDA) and its variants have been the workhorse for solving minimax problems. However, in contrast to the well-studied stochastic gradient descent (SGD) with differential privacy (DP) constraints, there is little work on understanding the generalization (utility) of SGDA with DP constraints. In this paper, we use the algorithmic stability approach to establish the generalization (utility) of DP-SGDA in different settings. In particular, for the convex-concave setting, we prove that the DP-SGDA can achieve an optimal utility rate in terms of the weak primal-dual population risk in both smooth and non-smooth cases. To our best knowledge, this is the first-ever-known result for DP-SGDA in the non-smooth case. We further provide its utility analysis in the nonconvex-strongly-concave setting which is the first-ever-known result in terms of the primal population risk. The convergence and generalization results for this nonconvex setting are new even in the non-private setting. Finally, numerical experiments are conducted to demonstrate the effectiveness of DP-SGDA for both convex and nonconvex cases.
One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares
Min, Youngjae, Ahn, Kwangjun, Azizan, Navid
While deep neural networks are capable of achieving state-of-the-art performance in various domains, their training typically requires iterating for many passes over the dataset. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the increasing use of overparameterized models, we develop Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit every new datapoint while changing the parameters in a direction that causes the least change to the predictions on previous datapoints. By doing so, we bridge two seemingly distinct algorithms in adaptive filtering and machine learning, namely the recursive least-squares (RLS) algorithm and orthogonal gradient descent (OGD). Our algorithm uses the memory efficiently by exploiting the structure of the streaming data via an incremental principal component analysis (IPCA). Further, we show that, for overparameterized linear models, the parameter vector obtained by our algorithm is what stochastic gradient descent (SGD) would converge to in the standard multi-pass setting. Finally, we generalize the results to the nonlinear setting for highly overparameterized models, relevant for deep learning. Our experiments show the effectiveness of the proposed method compared to the baselines.
BPFISH: Blockchain and Privacy-preserving FL Inspired Smart Healthcare
Singh, Moirangthem Biken, Pratap, Ajay
Abstract--This paper proposes Federated Learning (FL) based smart healthcare system where Medical Centers (MCs) train the local model using the data collected from patients and send the model weights to the miners in a blockchain-based robust framework without sharing raw data, keeping privacy preservation into deliberation. We formulate an optimization problem by maximizing the utility and minimizing the loss function considering energy consumption and FL process delay of MCs for learning effective models on distributed healthcare data underlying a blockchain-based framework. We propose a solution in two stages-first, offer a stable matching-based association algorithm to maximize the utility of both miners and MCs and then solve loss minimization using Stochastic Gradient Descent (SGD) algorithm employing FL under Differential Privacy (DP) and blockchain technology. Moreover, we incorporate blockchain technology to provide tempered resistant and decentralized model weight sharing in the proposed FL-based framework. The effectiveness of the proposed model is shown through simulation on real-world healthcare data comparing other state-of-the-art techniques.
Improving Simulated Annealing for Clique Partitioning Problems
Gao, Jian, Lv, Yiqi, Liu, Minghao, Cai, Shaowei, Ma, Feifei
The Clique Partitioning Problem (CPP) is essential in graph theory with a number of important applications. Due to its NP-hardness, efficient algorithms for solving this problem are very crucial for practical purposes, and simulated annealing is proved to be effective in state-of-the-art CPP algorithms. However, to make simulated annealing more efficient to solve large-scale CPPs, in this paper, we propose a new iterated simulated annealing algorithm. Several methods are proposed in our algorithm to improve simulated annealing. First, a new configuration checking strategy based on timestamp is presented and incorporated into simulated annealing to avoid search cycles. Afterwards, to enhance the local search ability of simulated annealing and speed up convergence, we combine our simulated annealing with a descent search method to solve the CPP. This method further improves solutions found by simulated annealing, and thus compensates for the local search effect. To further accelerate the convergence speed, we introduce a shrinking factor to decline initial temperature and then propose an iterated local search algorithm based on simulated annealing. Additionally, a restart strategy is adopted when the search procedure converges. Extensive experiments on benchmark instances of the CPP were carried out, and the results suggest that the proposed simulated annealing algorithm outperforms all the existing heuristic algorithms, including five state-of-the-art algorithms. Thus the best-known solutions for 34 instances out of 94 are updated. We also conduct comparative analyses of the proposed strategies and show their effectiveness.
Gradient Descent and its Types - Analytics Vidhya
This article was published as a part of the Data Science Blogathon. In this article, we will explore different types of gradient descent. So let's get started with the article. The algorithm designer can set the learning rate. If we use a learning rate that is too small, it will cause us to update very slowly, requiring more iterations to get a better solution.
On the benefits of non-linear weight updates
Recent work has suggested that the generalisation performance of a DNN is related to the extent to which the Signal-to-Noise Ratio is optimised at each of the nodes. In contrast, Gradient Descent methods do not always lead to SNR-optimal weight configurations. One way to improve SNR performance is to suppress large weight updates and amplify small weight updates. Such balancing is already implicit in some common optimizers, but we propose an approach that makes this explicit. The method applies a non-linear function to gradients prior to making DNN parameter updates. We investigate the performance with such non-linear approaches. The result is an adaptation to existing optimizers that improves performance for many problem types.