Goto

Collaborating Authors

 Gradient Descent


Byzantine-Resilient SGD in High Dimensions on Heterogeneous Data

arXiv.org Machine Learning

We study distributed stochastic gradient descent (SGD) in the master-worker architecture under Byzantine attacks. We consider the heterogeneous data model, where different workers may have different local datasets, and we do not make any probabilistic assumptions on data generation. At the core of our algorithm, we use the polynomial-time outlier-filtering procedure for robust mean estimation proposed by Steinhardt et al. (ITCS 2018) to filter-out corrupt gradients. In order to be able to apply their filtering procedure in our {\em heterogeneous} data setting where workers compute {\em stochastic} gradients, we derive a new matrix concentration result, which may be of independent interest. We provide convergence analyses for smooth strongly-convex and non-convex objectives. We derive our results under the bounded variance assumption on local stochastic gradients and a {\em deterministic} condition on datasets, namely, gradient dissimilarity; and for both these quantities, we provide concrete bounds in the statistical heterogeneous data model. We give a trade-off between the mini-batch size for stochastic gradients and the approximation error. Our algorithm can tolerate up to $\frac{1}{4}$ fraction Byzantine workers. It can find approximate optimal parameters in the strongly-convex setting exponentially fast and reach to an approximate stationary point in the non-convex setting with a linear speed, thus, matching the convergence rates of vanilla SGD in the Byzantine-free setting. We also propose and analyze a Byzantine-resilient SGD algorithm with gradient compression, where workers send $k$ random coordinates of their gradients. Under mild conditions, we show a $\frac{d}{k}$-factor saving in communication bits as well as decoding complexity over our compression-free algorithm without affecting its convergence rate (order-wise) and the approximation error.


Using Simulated Annealing to Declutter Genome Visualizations

AAAI Conferences

AccuSyn is an interactive browser that visualizes conserved synteny relations (similar features) in genomes, giving biologists insights into the evolutionary history and functional relationships between genes. Even simple organisms have huge numbers of genomic features, and raw synteny plots present a daunting clutter of connections of which to make sense. Using a mixed initiative approach, AccuSyn integrates simulated annealing, a well-known metaheuristic for optimization problems, with human interventions to offer non-experts a way to automate decluttering, eliminating a tedious manual bottleneck in the discovery of syntenic information. AccuSyn has since been deployed online to a world user community


Learn Gradient Descent (with code)

#artificialintelligence

Lots of statistics and machine learning involves turning a bunch of data into new numbers to make good decisions. For example, a data scientist might use your past bids on a Google search term, and the results, to work out the expected return on investment (ROI) for new bids. Armed with this knowledge you can make an informed decision about how much to bid in the future. Cool, but what if those ROIs are wrong? Luckily, data scientists don't just guess these numbers! They use data to generate a reasoned number for them.


Efficient Federated Learning over Multiple Access Channel with Differential Privacy Constraints

arXiv.org Machine Learning

In this paper, the problem of federated learning (FL) over a multiple access channel (MAC) is considered. More precisely, we consider the FL setting in which clients are prompted to train a machine learning model by simultaneous communications with a parameter server (PS) with the aim of better utilizing the computational resources available in the network. We also consider the additional constraint in which the communication between the users and the PS is subject to a privacy constraint. To minimize the training loss while also satisfying the privacy rate constraint over the MAC channel, the distributed transmission of digital variants of stochastic gradient descents (D-DSGD) is performed by each client. Additionally, binomial noise is also added at each user to preserve the privacy of the transmission. The optimum levels of quantization in the D-DSGD and the binary noise parameters to achieve efficiency in terms of convergence are investigated, subject to privacy constraint and capacity limit of the MAC channel.


An easy guide to Gradient Descent - AnalyticsWeek

#artificialintelligence

Variants of Gradient Descent What is Gradient Descent? Gradient Descent is an iterative process that finds the minima of a function. This is an optimisation algorithm that finds the parameters or coefficients of a function where the function has a minimum value. The post An easy guide to Gradient Descent appeared first on GreatLearning.


5 Best Courses to Learn Mathematics for Machine Learning

#artificialintelligence

So you want to learn the Mathematics for Machine Learning? Well, for Machine Learning or Deep Learning and AI, a thorough mathematical understanding is not an option. I know the options out there; prerequisites and the skills you need to become successful in Machine Learning and AI. If you want to learn Machine Learning, these classes will help you to master the mathematical foundation required for writing programs and algorithms for Machine Learning, Deep Learning and AI. My goal in this piece is to help you find the resources to gain good intuition and get you the hands-on experience you need with coding neural nets, stochastic gradient descent, and principal component analysis.


Convergence of Online Adaptive and Recurrent Optimization Algorithms

arXiv.org Machine Learning

We prove local convergence of several notable gradient descentalgorithms used inmachine learning, for which standard stochastic gradient descent theorydoes not apply. This includes, first, online algorithms for recurrent models and dynamicalsystems, such as \emph{Real-time recurrent learning} (RTRL) and its computationally lighter approximations NoBackTrack and UORO; second,several adaptive algorithms such as RMSProp, online natural gradient, and Adam with $\beta^2\to 1$.Despite local convergence being a relatively weak requirement for a newoptimization algorithm, no local analysis was available for these algorithms, as far aswe knew. Analysis of these algorithms does not immediately followfrom standard stochastic gradient (SGD) theory. In fact, Adam has been provedto lack local convergence in some simple situations. For recurrent models, online algorithms modify the parameterwhile the model is running, which further complicates the analysis withrespect to simple SGD.Local convergence for these various algorithms results from a single,more general set of assumptions, in the setup of learning dynamicalsystems online. Thus, these results can cover other variants ofthe algorithms considered.We adopt an ``ergodic'' rather than probabilistic viewpoint, working withempirical time averages instead of probability distributions. This ismore data-agnostic andcreates differences with respect to standard SGD theory,especially for the range of possible learning rates. For instance, withcycling or per-epoch reshuffling over a finite dataset instead of purei.i.d. sampling with replacement, empiricalaverages of gradients converge at rate $1/T$ insteadof $1/\sqrt{T}$ (cycling acts as a variance reduction method),theoretically allowingfor larger learning rates than in SGD.


FedSplit: An algorithmic framework for fast federated optimization

arXiv.org Machine Learning

Federated learning is a rapidly evolving application of distributed optimization for estimation and learning problems in large-scale networks of remote clients [13]. These systems present new challenges, as they are characterized by heterogeneity in computational resources and data across the network, unreliable communication, massive scale, and privacy constraints [16]. A typical application is for developers of cell phones and cellular applications to model the usage of software and devices across millions or even billions of users. Distributed optimization has a rich history and extensive literature (e.g., see the sources [2, 5, 8, 31, 15, 24] and references therein), and federated learning has led to a flurry of interest in the area. A number of different procedures have been proposed for federated learning and related problems, using methods based on stochastic gradient methods or proximal procedures. Notably, McMahan et al. [18] introduced the FedSGD and FedAvg algorithms, which both adapt the classical stochastic gradient method to the federated setting, considering the possibility that clients may fail and may only be subsampled on each round of computation. Another recent proposal has been to use regularized local problems to mitigate possible issues that arise with device heterogeneity and failures [17]. These authors propose the FedProx procedure, an algorithm that applied averaged proximal updates to solve federated minimization problems. Currently, the convergence theory and correctness of these methods is currently lacking, and practitioners have documented failures of convergence in certain settings (e.g., see Figure 3 and related discussion in the work [18]).


AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

arXiv.org Machine Learning

Although adaptive optimization algorithms such as Adam show fast convergence in many machine learning tasks, this paper identifies a problem of Adam by analyzing its performance in a simple non-convex synthetic problem, showing that Adam's fast convergence would possibly lead the algorithm to local minimums. To address this problem, we improve Adam by proposing a novel adaptive gradient descent algorithm named AdaX. Unlike Adam that ignores the past gradients, AdaX exponentially accumulates the long-term gradient information in the past during training, to adaptively tune the learning rate. We thoroughly prove the convergence of AdaX in both the convex and non-convex settings. Extensive experiments show that AdaX outperforms Adam in various tasks of computer vision and natural language processing and can catch up with Stochastic Gradient Descent.


High-Dimensional Robust Mean Estimation via Gradient Descent

arXiv.org Machine Learning

We study the problem of high-dimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomial-time algorithms for this problem with dimension-independent error guarantees for a range of natural distribution families. In this work, we show that a natural non-convex formulation of the problem can be solved directly by gradient descent. Our approach leverages a novel structural lemma, roughly showing that any approximate stationary point of our non-convex objective gives a near-optimal solution to the underlying robust estimation task. Our work establishes an intriguing connection between algorithmic high-dimensional robust statistics and non-convex optimization, which may have broader applications to other robust estimation tasks.