Goto

Collaborating Authors

 Gradient Descent


Generalisation under gradient descent via deterministic PAC-Bayes

arXiv.org Artificial Intelligence

We establish disintegrated PAC-Bayesian generalisation bounds for models trained with gradient descent methods or continuous gradient flows. Contrary to standard practice in the PAC-Bayesian setting, our result applies to optimisation algorithms that are deterministic, without requiring any de-randomisation step. Our bounds are fully computable, depending on the density of the initial distribution and the Hessian of the training objective over the trajectory. We show that our framework can be applied to a variety of iterative optimisation algorithms, including stochastic gradient descent (SGD), momentum-based schemes, and damped Hamiltonian dynamics.


CGDTest: A Constrained Gradient Descent Algorithm for Testing Neural Networks

arXiv.org Artificial Intelligence

In this paper, we propose a new Deep Neural Network (DNN) testing algorithm called the Constrained Gradient Descent (CGD) method, and an implementation we call CGDTest aimed at exposing security and robustness issues such as adversarial robustness and bias in DNNs. Our CGD algorithm is a gradient-descent (GD) method, with the twist that the user can also specify logical properties that characterize the kinds of inputs that the user may want. This functionality sets CGDTest apart from other similar DNN testing tools since it allows users to specify logical constraints to test DNNs not only for $\ell_p$ ball-based adversarial robustness but, more importantly, includes richer properties such as disguised and flow adversarial constraints, as well as adversarial robustness in the NLP domain. We showcase the utility and power of CGDTest via extensive experimentation in the context of vision and NLP domains, comparing against 32 state-of-the-art methods over these diverse domains. Our results indicate that CGDTest outperforms state-of-the-art testing tools for $\ell_p$ ball-based adversarial robustness, and is significantly superior in testing for other adversarial robustness, with improvements in PAR2 scores of over 1500% in some cases over the next best tool. Our evaluation shows that our CGD method outperforms competing methods we compared against in terms of expressibility (i.e., a rich constraint language and concomitant tool support to express a wide variety of properties), scalability (i.e., can be applied to very large real-world models with up to 138 million parameters), and generality (i.e., can be used to test a plethora of model architectures).


Asynchronous Iterations in Optimization: New Sequence Results and Sharper Algorithmic Guarantees

arXiv.org Artificial Intelligence

We introduce novel convergence results for asynchronous iterations that appear in the analysis of parallel and distributed optimization algorithms. The results are simple to apply and give explicit estimates for how the degree of asynchrony impacts the convergence rates of the iterates. Our results shorten, streamline and strengthen existing convergence proofs for several asynchronous optimization methods and allow us to establish convergence guarantees for popular algorithms that were thus far lacking a complete theoretical understanding. Specifically, we use our results to derive better iteration complexity bounds for proximal incremental aggregated gradient methods, to obtain tighter guarantees depending on the average rather than maximum delay for the asynchronous stochastic gradient descent method, to provide less conservative analyses of the speedup conditions for asynchronous block-coordinate implementations of Krasnoselskii-Mann iterations, and to quantify the convergence rates for totally asynchronous iterations under various assumptions on communication delays and update rates.


Evolving Artificial Neural Networks To Imitate Human Behaviour In Shinobi III : Return of the Ninja Master

arXiv.org Artificial Intelligence

Our society is increasingly fond of computational tools. This phenomenon has greatly increased over the past decade following, among other factors, the emergence of a new Artificial Intelligence paradigm. Specifically, the coupling of two algorithmic techniques, Deep Neural Networks and Stochastic Gradient Descent, thrusted by an exponentially increasing computing capacity, has and is continuing to become a major asset in many modern technologies. However, as progress takes its course, some still wonder whether other methods could similarly or even more greatly benefit from these various hardware advances. In order to further this study, we delve in this thesis into Evolutionary Algorithms and their application to Dynamic Neural Networks, two techniques which despite enjoying many advantageous properties have yet to find their niche in contemporary Artificial Intelligence. We find that by elaborating new methods while exploiting strong computational resources, it becomes possible to develop strongly performing agents on a variety of benchmarks but also some other agents behaving very similarly to human subjects on the video game Shinobi III : Return of The Ninja Master, typical complex tasks previously out of reach for non-gradient-based optimization.


Learning by Grouping: A Multilevel Optimization Framework for Improving Fairness in Classification without Losing Accuracy

arXiv.org Artificial Intelligence

The integration of machine learning models in various real-world applications is becoming more prevalent to assist humans in their daily decision-making tasks as a result of recent advancements in this field. However, it has been discovered that there is a tradeoff between the accuracy and fairness of these decision-making tasks. In some cases, these AI systems can be unfair by exhibiting bias or discrimination against certain social groups, which can have severe consequences in real life. Inspired by one of the most well-known human learning skills called grouping, we address this issue by proposing a novel machine learning framework where the ML model learns to group a diverse set of problems into distinct subgroups to solve each subgroup using its specific sub-model. Our proposed framework involves three stages of learning, which are formulated as a three-level optimization problem: (i) learning to group problems into different subgroups; (ii) learning group-specific sub-models for problem-solving; and (iii) updating group assignments of training examples by minimizing the validation loss. These three learning stages are performed end-to-end in a joint manner using gradient descent. To improve fairness and accuracy, we develop an efficient optimization algorithm to solve this three-level optimization problem. To further reduce the risk of overfitting in small datasets, we incorporate domain adaptation techniques in the second stage of training. We further apply our method to neural architecture search. Extensive experiments on various datasets demonstrate our method's effectiveness and performance improvements in both fairness and accuracy. Our proposed Learning by Grouping can reduce overfitting and achieve state-of-the-art performances with fixed human-designed network architectures and searchable network architectures on various datasets.


Fast Convergence of Random Reshuffling under Over-Parameterization and the Polyak-\L ojasiewicz Condition

arXiv.org Artificial Intelligence

Modern machine learning models are often over-parameterized and as a result they can interpolate the training data. Under such a scenario, we study the convergence properties of a sampling-without-replacement variant of stochastic gradient descent (SGD) known as random reshuffling (RR). Unlike SGD that samples data with replacement at every iteration, RR chooses a random permutation of data at the beginning of each epoch and each iteration chooses the next sample from the permutation. For under-parameterized models, it has been shown RR can converge faster than SGD under certain assumptions. However, previous works do not show that RR outperforms SGD in over-parameterized settings except in some highly-restrictive scenarios. For the class of Polyak-\L ojasiewicz (PL) functions, we show that RR can outperform SGD in over-parameterized settings when either one of the following holds: (i) the number of samples ($n$) is less than the product of the condition number ($\kappa$) and the parameter ($\alpha$) of a weak growth condition (WGC), or (ii) $n$ is less than the parameter ($\rho$) of a strong growth condition (SGC).


High-dimensional scaling limits and fluctuations of online least-squares SGD with smooth covariance

arXiv.org Machine Learning

We derive high-dimensional scaling limits and fluctuations for the online least-squares Stochastic Gradient Descent (SGD) algorithm by taking the properties of the data generating model explicitly into consideration. Our approach treats the SGD iterates as an interacting particle system, where the expected interaction is characterized by the covariance structure of the input. Assuming smoothness conditions on moments of order up to eight orders, and without explicitly assuming Gaussianity, we establish the high-dimensional scaling limits and fluctuations in the form of infinite-dimensional Ordinary Differential Equations (ODEs) or Stochastic Differential Equations (SDEs). Our results reveal a precise three-step phase transition of the iterates; it goes from being ballistic, to diffusive, and finally to purely random behavior, as the noise variance goes from low, to moderate and finally to very-high noise setting. In the low-noise setting, we further characterize the precise fluctuations of the (scaled) iterates as infinite-dimensional SDEs. We also show the existence and uniqueness of solutions to the derived limiting ODEs and SDEs. Our results have several applications, including characterization of the limiting mean-square estimation or prediction errors and their fluctuations which can be obtained by analytically or numerically solving the limiting equations.


Doubly Stochastic Models: Learning with Unbiased Label Noises and Inference Stability

arXiv.org Artificial Intelligence

Random label noises (or observational noises) widely exist in practical machine learning settings. While previous studies primarily focus on the affects of label noises to the performance of learning, our work intends to investigate the implicit regularization effects of the label noises, under mini-batch sampling settings of stochastic gradient descent (SGD), with assumptions that label noises are unbiased. Specifically, we analyze the learning dynamics of SGD over the quadratic loss with unbiased label noises, where we model the dynamics of SGD as a stochastic differentiable equation (SDE) with two diffusion terms (namely a Doubly Stochastic Model). While the first diffusion term is caused by mini-batch sampling over the (label-noiseless) loss gradients as many other works on SGD, our model investigates the second noise term of SGD dynamics, which is caused by mini-batch sampling over the label noises, as an implicit regularizer. Our theoretical analysis finds such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters (namely inference stability). Though similar phenomenon have been investigated, our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more generalizable result with convergence of approximation proved. To validate our analysis, we design two sets of empirical studies to analyze the implicit regularizer of SGD with unbiased random label noises for deep neural networks training and linear regression.


Federated Learning Using Variance Reduced Stochastic Gradient for Probabilistically Activated Agents

arXiv.org Artificial Intelligence

This paper proposes an algorithm for Federated Learning (FL) with a two-layer structure that achieves both variance reduction and a faster convergence rate to an optimal solution in the setting where each agent has an arbitrary probability of selection in each iteration. In distributed machine learning, when privacy matters, FL is a functional tool. Placing FL in an environment where it has some irregular connections of agents (devices), reaching a trained model in both an economical and quick way can be a demanding job. The first layer of our algorithm corresponds to the model parameter propagation across agents done by the server. In the second layer, each agent does its local update with a stochastic and variance-reduced technique called Stochastic Variance Reduced Gradient (SVRG). We leverage the concept of variance reduction from stochastic optimization when the agents want to do their local update step to reduce the variance caused by stochastic gradient descent (SGD). We provide a convergence bound for our algorithm which improves the rate from $O(\frac{1}{\sqrt{K}})$ to $O(\frac{1}{K})$ by using a constant step-size. We demonstrate the performance of our algorithm using numerical examples.


Conflict-Averse Gradient Optimization of Ensembles for Effective Offline Model-Based Optimization

arXiv.org Artificial Intelligence

Data-driven offline model-based optimization (MBO) is an established practical approach to black-box computational design problems for which the true objective function is unknown and expensive to query. However, the standard approach which optimizes designs against a learned proxy model of the ground truth objective can suffer from distributional shift. Specifically, in high-dimensional design spaces where valid designs lie on a narrow manifold, the standard approach is susceptible to producing out-of-distribution, invalid designs that "fool" the learned proxy model into outputting a high value. Using an ensemble rather than a single model as the learned proxy can help mitigate distribution shift, but naive formulations for combining gradient information from the ensemble, such as minimum or mean gradient, are still suboptimal and often hampered by non-convergent behavior. In this work, we explore alternate approaches for combining gradient information from the ensemble that are robust to distribution shift without compromising optimality of the produced designs. More specifically, we explore two functions, formulated as convex optimization problems, for combining gradient information: multiple gradient descent algorithm (MGDA) and conflict-averse gradient descent (CAGrad). We evaluate these algorithms on a diverse set of five computational design tasks. We compare performance of ensemble MBO with MGDA and ensemble MBO with CAGrad with three naive baseline algorithms: (a) standard single-model MBO, (b) ensemble MBO with mean gradient, and (c) ensemble MBO with minimum gradient. Our results suggest that MGDA and CAGrad strike a desirable balance between conservatism and optimality and can help robustify data-driven offline MBO without compromising optimality of designs.