Goto

Collaborating Authors

 Gradient Descent


r/MachineLearning - [R]: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

#artificialintelligence

The authors use a classic Armijo line-search approach in the context of SGD to automatically tune the line search parameter in training the neural networks. They're also able to prove convergence results on minimizing convex and non-convex objective functions satisfying certain growth conditions. An aside, but as an optimization-head myself, it's nice to see some of the traditional optimization ideas make their way into an ML context.


Dynamic Network Embedding via Incremental Skip-gram with Negative Sampling

arXiv.org Machine Learning

Network representation learning, as an approach to learn low dimensional representations of vertices, has attracted considerable research attention recently. It has been proven extremely useful in many machine learning tasks over large graph. Most existing methods focus on learning the structural representations of vertices in a static network, but cannot guarantee an accurate and efficient embedding in a dynamic network scenario. To address this issue, we present an efficient incremental skip-gram algorithm with negative sampling for dynamic network embedding, and provide a set of theoretical analyses to characterize the performance guarantee. Specifically, we first partition a dynamic network into the updated, including addition/deletion of links and vertices, and the retained networks over time. Then we factorize the objective function of network embedding into the added, vanished and retained parts of the network. Next we provide a new stochastic gradient-based method, guided by the partitions of the network, to update the nodes and the parameter vectors. The proposed algorithm is proven to yield an objective function value with a bounded difference to that of the original objective function. Experimental results show that our proposal can significantly reduce the training time while preserving the comparable performance. We also demonstrate the correctness of the theoretical analysis and the practical usefulness of the dynamic network embedding. We perform extensive experiments on multiple real-world large network datasets over multi-label classification and link prediction tasks to evaluate the effectiveness and efficiency of the proposed framework, and up to 22 times speedup has been achieved.


Refined Generalization Analysis of Gradient Descent for Over-parameterized Two-layer Neural Networks with Smooth Activations on Classification Problems

arXiv.org Machine Learning

Recently, several studies have proven the global convergence and generalization abilities of the gradient descent method for two-layer ReLU networks by making a positivity assumption of the Gram-matrix of the neural tangent kernel. However, the performance of gradient descent on classification problems has not been well studied, and further investigation of the problem structure is possible. In this work, we present a partially stronger but reasonable assumption for binary classification problems compared to the positivity assumption of the Gram-matrix, where a data distribution can be perfectly classifiable by a tangent model, and we provide a refined generalization analysis of the gradient descent method for two-layer networks with smooth activations. A remarkable point of this study is that our generalization bound has much better dependence on the network width compared to existing results. As a result, our theory significantly enlarges a class of over-parameterized networks having provable generalization ability, with respect to network width, while most studies require much higher over-parameterization.


Non-Differentiable Supervised Learning with Evolution Strategies and Hybrid Methods

arXiv.org Machine Learning

In this work we show that Evolution Strategies (ES) are a viable method for learning non-differentiable parameters of large supervised models. ES are black-box optimization algorithms that estimate distributions of model parameters; however they have only been used for relatively small problems so far. We show that it is possible to scale ES to more complex tasks and models with millions of parameters. While using ES for differentiable parameters is computationally impractical (although possible), we show that a hybrid approach is practically feasible in the case where the model has both differentiable and non-differentiable parameters. In this approach we use standard gradient-based methods for learning differentiable weights, while using ES for learning non-differentiable parameters - in our case sparsity masks of the weights. This proposed method is surprisingly competitive, and when parallelized over multiple devices has only negligible training time overhead compared to training with gradient descent. Additionally, this method allows to train sparse models from the first training step, so they can be much larger than when using methods that require training dense models first.


Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel

arXiv.org Machine Learning

Stochastic Gradient Descent (SGD) is widely used to train deep neural networks. However, few theoretical results on the training dynamics of SGD are available. Recent work by Jacot et al. (2018) has showed that training a neural network of any kind with a full batch gradient descent in parameter space is equivalent to kernel gradient descent in function space with respect to the Neural Tangent Kernel (NTK). Lee et al. (2019) built on this result to show that the output of a neural network trained using full batch gradient descent can be approximated by a linear model for wide neural networks. We show here how these results can be extended to SGD. In this case, the resulting training dynamics is given by a stochastic differential equation dependent on the NTK which becomes a simple mean-reverting process for the squared loss. When the network depth is also large, we provide a comprehensive analysis on the impact of the initialization and the activation function on the NTK, and thus on the corresponding training dynamics under SGD. We provide experiments illustrating our theoretical results.


Online Forecasting of Total-Variation-bounded Sequences

arXiv.org Machine Learning

We consider the problem of online forecasting of sequences of length $n$ with total-variation at most $C_n$ using observations contaminated by independent $\sigma$-subgaussian noise. We design an $O(n\log n)$-time algorithm that achieves a cumulative square error of $\tilde{O}(n^{1/3}C_n^{2/3}\sigma^{4/3})$ with high probability. The result is rate-optimal as it matches the known minimax rate for the offline nonparametric estimation of the same class [Mammen and van de Geer, 1997]. To the best of our knowledge, this is the first \emph{polynomial-time} algorithm that optimally forecasts total variation bounded sequences. Our proof techniques leverage the special localized structure of Haar wavelet basis and adaptivity to unknown smoothness parameter in the classical wavelet smoothing [Donoho et al., 1998]. We also compare our model to the rich literature of dynamic regret minimization and nonstationary stochastic optimization, where our problem can be treated as a special case. We show that the workhorse in those settings --- online gradient descent and its variants with a fixed restarting schedule --- are instances of a class of \emph{linear forecasters} that require a suboptimal regret of $\tilde{\Omega}(\sqrt{n})$. This implies that the use of more adaptive algorithms are necessary to obtain the optimal rate.


Efficient Project Gradient Descent for Ensemble Adversarial Attack

arXiv.org Machine Learning

Recent advances show that deep neural networks are not robust to deliberately crafted adversarial examples which many are generated by adding human imperceptible perturbation to clear input. Consider $l_2$ norms attacks, Project Gradient Descent (PGD) and the Carlini and Wagner (C\&W) attacks are the two main methods, where PGD control max perturbation for adversarial examples while C\&W approach treats perturbation as a regularization term optimized it with loss function together. If we carefully set parameters for any individual input, both methods become similar. In general, PGD attacks perform faster but obtains larger perturbation to find adversarial examples than the C\&W when fixing the parameters for all inputs. In this report, we propose an efficient modified PGD method for attacking ensemble models by automatically changing ensemble weights and step size per iteration per input. This method generates smaller perturbation adversarial examples than PGD method while remains efficient as compared to C\&W method. Our method won the first place in IJCAI19 Targeted Adversarial Attack competition.


Computing Exact Guarantees for Differential Privacy

arXiv.org Machine Learning

Quantification of the privacy loss associated with a randomised algorithm has become an active area of research and $(\varepsilon,\delta)$-differential privacy has arisen as the standard measure of it. We propose a numerical method for evaluating the parameters of differential privacy for algorithms with continuous one dimensional output. In this way the parameters $\varepsilon$ and $\delta$ can be evaluated, for example, for the subsampled multidimensional Gaussian mechanism which is also the underlying mechanism of differentially private stochastic gradient descent. The proposed method is based on a numerical approximation of an integral formula which gives the exact $(\varepsilon,\delta)$-values. The approximation is carried out by discretising the integral and by evaluating discrete convolutions using a fast Fourier transform algorithm. We give theoretical error bounds which show the convergence of the approximation and guarantee its accuracy to an arbitrary degree. Experimental comparisons with state-of-the-art techniques illustrate the efficacy of the method. Python code for the proposed method can be found in Github (https://github.com/DPBayes/PLD-Accountant/).


Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family Approximations

arXiv.org Machine Learning

Natural-gradient methods enable fast and simple algorithms for variational inference, but due to computational difficulties, their use is mostly limited to \emph{minimal} exponential-family (EF) approximations. In this paper, we extend their application to estimate \emph{structured} approximations such as mixtures of EF distributions. Such approximations can fit complex, multimodal posterior distributions and are generally more accurate than unimodal EF approximations. By using a \emph{minimal conditional-EF} representation of such approximations, we derive simple natural-gradient updates. Our empirical results demonstrate a faster convergence of our natural-gradient method compared to black-box gradient-based methods. Our work expands the scope of natural gradients for Bayesian inference and makes them more widely applicable than before.


Bad Global Minima Exist and SGD Can Reach Them

arXiv.org Machine Learning

Several recent works have aimed to explain why severely overparameterized models, generalize well when trained by Stochastic Gradient Descent (SGD). The emergent consensus explanation has two parts: the first is that there are "no bad local minima", while the second is that SGD performs implicit regularization by having a bias towards low complexity models. We revisit both of these ideas in the context of image classification with common deep neural network architectures. Our first finding is that there exist bad global minima, i.e., models that fit the training set perfectly, yet have poor generalization. Our second finding is that given only unlabeled training data, we can easily construct initializations that will cause SGD to quickly converge to such bad global minima. For example, on CIFAR, CINIC10, and (Restricted) ImageNet, this can be achieved by starting SGD at a model derived by fitting random labels on the training data: while subsequent SGD training (with the correct labels) will reach zero training error, the resulting model will exhibit a test accuracy degradation of up to 40% compared to training from a random initialization. Finally, we show that regularization seems to provide SGD with an escape route: once heuristics such as data augmentation are used, starting from a complex model (adversarial initialization) has no effect on the test accuracy.