Nguyen, Phuong Ha
Generalizing DP-SGD with Shuffling and Batch Clipping
van Dijk, Marten, Nguyen, Phuong Ha, Nguyen, Toan N., Nguyen, Lam M.
Classical differential private DP-SGD implements individual clipping with random subsampling, which forces a mini-batch SGD approach. We provide a general differential private algorithmic framework that goes beyond DP-SGD and allows any possible first order optimizers (e.g., classical SGD and momentum based SGD approaches) in combination with batch clipping, which clips an aggregate of computed gradients rather than summing clipped gradients (as is done in individual clipping). The framework also admits sampling techniques beyond random subsampling such as shuffling. Our DP analysis follows the $f$-DP approach and introduces a new proof technique which allows us to derive simple closed form expressions and to also analyse group privacy. In particular, for $E$ epochs work and groups of size $g$, we show a $\sqrt{g E}$ DP dependency for batch clipping with shuffling.
Batch Clipping and Adaptive Layerwise Clipping for Differential Private Stochastic Gradient Descent
Nguyen, Toan N., Nguyen, Phuong Ha, Nguyen, Lam M., Van Dijk, Marten
Each round in Differential Private Stochastic Gradient Descent (DPSGD) transmits a sum of clipped gradients obfuscated with Gaussian noise to a central server which uses this to update a global model which often represents a deep neural network. Since the clipped gradients are computed separately, which we call Individual Clipping (IC), deep neural networks like resnet-18 cannot use Batch Normalization Layers (BNL) which is a crucial component in deep neural networks for achieving a high accuracy. To utilize BNL, we introduce Batch Clipping (BC) where, instead of clipping single gradients as in the orginal DPSGD, we average and clip batches of gradients. Moreover, the model entries of different layers have different sensitivities to the added Gaussian noise. Therefore, Adaptive Layerwise Clipping methods (ALC), where each layer has its own adaptively finetuned clipping constant, have been introduced and studied, but so far without rigorous DP proofs. In this paper, we propose {\em a new ALC and provide rigorous DP proofs for both BC and ALC}. Experiments show that our modified DPSGD with BC and ALC for CIFAR-$10$ with resnet-$18$ converges while DPSGD with IC and ALC does not.
Considerations on the Theory of Training Models with Differential Privacy
van Dijk, Marten, Nguyen, Phuong Ha
Privacy leakage is a big problem in the big-data era. Solving a learning task based on big data intrinsically means that only through a collaborative effort sufficient data is available for training a global model with sufficient clean accuracy (utility). Federated learning is a framework where a learning task is solved by a loose federation of participating devices/clients which are coordinated by a central server [42, 8, 3, 33, 40, 9, 57, 29, 59, 10, 34, 36, 37, 39, 12, 30]. Clients, who use own local data to participate in a learning task by training a global model, want to have privacy guarantees for their local proprietary data. For this reason DP-SGD [1] was introduced as it adapts distributed Stochastic Gradient Descent (SGD)[55] with Differential Privacy (DP)[19, 15, 21, 18].
Differential Private Hogwild! over Distributed Local Data Sets
van Dijk, Marten, Nguyen, Nhuong V., Nguyen, Toan N., Nguyen, Lam M., Nguyen, Phuong Ha
The objective is to minimize a loss function with respect to model parameters w. This problem is known as empirical risk minimization and it covers a wide range of convex and non-convex problems from the ML domain, including, but not limited to, logistic regression, multi-kernel learning, conditional random fields and neural networks. We want to solve (1) in a distributed setting where many clients have their own local data sets and the finite-sum minimization problem is over the collection of all local data sets.
Hogwild! over Distributed Local Data Sets with Linearly Increasing Mini-Batch Sizes
van Dijk, Marten, Nguyen, Nhuong V., Nguyen, Toan N., Nguyen, Lam M., Tran-Dinh, Quoc, Nguyen, Phuong Ha
Hogwild! implements asynchronous Stochastic Gradient Descent (SGD) where multiple threads in parallel access a common repository containing training data, perform SGD iterations and update shared state that represents a jointly learned (global) model. We consider big data analysis where training data is distributed among local data sets -- and we wish to move SGD computations to local compute nodes where local data resides. The results of these local SGD computations are aggregated by a central "aggregator" which mimics Hogwild!. We show how local compute nodes can start choosing small mini-batch sizes which increase to larger ones in order to reduce communication cost (round interaction with the aggregator). We prove a tight and novel non-trivial convergence analysis for strongly convex problems which does not use the bounded gradient assumption as seen in many existing publications. The tightness is a consequence of our proofs for lower and upper bounds of the convergence rate, which show a constant factor difference. We show experimental results for plain convex and non-convex problems for biased and unbiased local data sets.
Asynchronous Federated Learning with Reduced Number of Rounds and with Differential Privacy from Less Aggregated Gaussian Noise
van Dijk, Marten, Nguyen, Nhuong V., Nguyen, Toan N., Nguyen, Lam M., Tran-Dinh, Quoc, Nguyen, Phuong Ha
The feasibility of federated learning is highly constrained by the server-clients infrastructure in terms of network communication. Most newly launched smartphones and IoT devices are equipped with GPUs or sufficient computing hardware to run powerful AI models. However, in case of the original synchronous federated learning, client devices suffer waiting times and regular communication between clients and server is required. This implies more sensitivity to local model training times and irregular or missed updates, hence, less or limited scalability to large numbers of clients and convergence rates measured in real time will suffer. We propose a new algorithm for asynchronous federated learning which eliminates waiting times and reduces overall network communication - we provide rigorous theoretical analysis for strongly convex objective functions and provide simulation results. By adding Gaussian noise we show how our algorithm can be made differentially private -- new theorems show how the aggregated added Gaussian noise is significantly reduced.
Beware the Black-Box: on the Robustness of Recent Defenses to Adversarial Examples
Mahmood, Kaleel, Gurevin, Deniz, van Dijk, Marten, Nguyen, Phuong Ha
Recent defenses published at venues like NIPS, ICML, ICLR and CVPR are mainly focused on mitigating white-box attacks. These defenses do not properly consider adaptive adversaries. In this paper, we expand the scope of these defenses to include adaptive black-box adversaries. Our evaluation is done on nine defenses including Barrage of Random Transforms, ComDefend, Ensemble Diversity, Feature Distillation, The Odds are Odd, Error Correcting Codes, Distribution Classifier Defense, K-Winner Take All and Buffer Zones. Our investigation is done using two black-box adversarial models and six widely studied adversarial attacks for CIFAR-10 and Fashion-MNIST datasets. Our analyses show most recent defenses provide only marginal improvements in security, as compared to undefended networks. Based on these results, we propose new standards for properly evaluating defenses to black-box adversaries. We provide this security framework to assist researchers in developing future black-box resistant models.
BUZz: BUffer Zones for defending adversarial examples in image classification
Nguyen, Phuong Ha, Mahmood, Kaleel, Nguyen, Lam M., Nguyen, Thanh, van Dijk, Marten
BUZ Z: BU FFER Z ONES FOR DEFENDING ADVERSAR - IAL EXAMPLES IN IMAGE CLASSIFICATION Phuong Ha Nguyen 1, Kaleel Mahmood 1, Lam M. Nguyen 2, Thanh Nguyen 3, Marten van Dijk 1,4 1 Department of Electrical and Computer Engineering, University of Connecticut, USA 2 IBM Research, Thomas J. Watson Research Center, Y orktown Heights, USA 3 Iowa State University, USA 4 CWI Amsterdam, The Netherlands Equally contributed phuongha.ntu@gmail.com, Abstract We propose a novel defense against all existing gradient based adversarial attacks on deep neural networks for image classification problems. Our defense is based on a combination of deep neural networks and simple image transformations. While straight forward in implementation, this defense yields a unique security property which we term buffer zones. We argue that our defense based on buffer zones is secure against state-of-the-art black box attacks. We are able to achieve this security even when the adversary has access to the entire ...
DTN: A Learning Rate Scheme with Convergence Rate of $\mathcal{O}(1/t)$ for SGD
Nguyen, Lam M., Nguyen, Phuong Ha, Phan, Dzung T., Kalagnanam, Jayant R., van Dijk, Marten
We propose a novel diminishing learning rate scheme, coined Decreasing-Trend-Nature (DTN), which allows us to prove fast convergence of the Stochastic Gradient Descent (SGD) algorithm to a first-order stationary point for smooth general convex and some class of nonconvex including neural network applications for classification problems. We are the first to prove that SGD with diminishing learning rate achieves a convergence rate of $\mathcal{O}(1/t)$ for these problems. Our theory applies to neural network applications for classification problems in a straightforward way.
Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH
Nguyen, Lam M., van Dijk, Marten, Phan, Dzung T., Nguyen, Phuong Ha, Weng, Tsui-Wei, Kalagnanam, Jayant R.
The total complexity (measured as the total number of gradient computations) of a stochastic first-order optimization algorithm that finds a first-order stationary point of a finite-sum smooth nonconvex objective function $F(w)=\frac{1}{n} \sum_{i=1}^n f_i(w)$ has been proven to be at least $\Omega(\sqrt{n}/\epsilon)$ where $\epsilon$ denotes the attained accuracy $\mathbb{E}[ \|\nabla F(\tilde{w})\|^2] \leq \epsilon$ for the outputted approximation $\tilde{w}$ (Fang et al.,2018). This paper is the first to show that this lower bound is tight for the class of variance reduction methods which only assume the Lipschitz continuous gradient assumption. We prove this complexity result for a slightly modified version of the SARAH algorithm in (Nguyen et al.,2017a;b) - showing that SARAH is optimal and dominates all existing results. For convex optimization, we propose SARAH++ with sublinear convergence for general convex and linear convergence for strongly convex problems; and we provide a practical version for which numerical experiments on various datasets show an improved performance.