Goto

Collaborating Authors

 Gradient Descent


Neural Spectrum Alignment

arXiv.org Machine Learning

Expressiveness of deep models was recently addressed via the connection between neural networks (NNs) and kernel learning, where first-order dynamics of NN during a gradient-descent (GD) optimization were related to gradient similarity kernel, also known as Neural Tangent Kernel (NTK). In the majority of works this kernel is considered to be time-invariant, with its properties being defined entirely by NN architecture and independent of the learning task at hand. In contrast, in this paper we empirically explore these properties along the optimization and show that in practical applications the NN kernel changes in a very dramatic and meaningful way, with its top eigenfunctions aligning toward the target function learned by NN. Moreover, these top eigenfunctions serve sort of basis functions for NN output - a function represented by NN is spanned almost completely by them for the entire optimization process. Further, since the learning along top eigenfunctions is typically fast, their alignment with the target function improves the overall optimization performance. In addition, we study how the neural spectrum is affected by learning rate decay, typically done by practitioners, showing various trends in the kernel behavior. We argue that the presented phenomena may lead to a more complete theoretical understanding behind NN learning.


Learning Boolean Circuits with Neural Networks

arXiv.org Machine Learning

Training neural-networks is computationally hard. However, in practice they are trained efficiently using gradient-based algorithms, achieving remarkable performance on natural data. To bridge this gap, we observe the property of local correlation: correlation between small patterns of the input and the target label. We focus on learning deep neural-networks with a variant of gradient-descent, when the target function is a tree-structured Boolean circuit. We show that in this case, the existence of correlation between the gates of the circuit and the target label determines whether the optimization succeeds or fails. Using this result, we show that neural-networks can learn the (log n)-parity problem for most product distributions. These results hint that local correlation may play an important role in differentiating between distributions that are hard or easy to learn.


Bias-Variance Tradeoff in a Sliding Window Implementation of the Stochastic Gradient Algorithm

arXiv.org Machine Learning

This paper provides a framework to analyze stochastic gradient algorithms in a mean squared error (MSE) sense using the asymptotic normality result of the stochastic gradient descent (SGD) iterates. We perform this analysis by taking the asymptotic normality result and applying it to the finite iteration case. Specifically, we look at problems where the gradient estimators are biased and have reduced variance and compare the iterates generated by these gradient estimators to the iterates generated by the SGD algorithm. We use the work of Fabian to characterize the mean and the variance of the distribution of the iterates in terms of the bias and the covariance matrix of the gradient estimators. We introduce the sliding window SGD (SW-SGD) algorithm, with its proof of convergence, which incurs a lower MSE than the SGD algorithm on quadratic and convex problems. Lastly, we present some numerical results to show the effectiveness of this framework and the superiority of SW-SGD algorithm over the SGD algorithm.


Non-Gaussianity of Stochastic Gradient Noise

arXiv.org Machine Learning

What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.


Autoencoding with XCSF

arXiv.org Artificial Intelligence

Autoencoders enable data dimensionality reduction and are a key component of many (deep) learning systems. This article explores the use of the XCSF online evolutionary reinforcement learning system to perform autoencoding. Initial results using a neural network representation and combining artificial evolution with stochastic gradient descent, suggest it is an effective approach to data reduction. The approach adaptively subdivides the input domain into local approximations that are simpler than a global neural network solution. By allowing the number of neurons in the autoencoders to evolve, this further enables the emergence of an ensemble of structurally heterogeneous solutions to cover the problem space. In this case, networks of differing complexity are typically seen to cover different areas of the problem space. Furthermore, the rate of gradient descent applied to each layer is tuned via self-adaptive mutation, thereby reducing the parameter optimisation task.


r/MachineLearning - [D] Machine Learning - WAYR (What Are You Reading) - Week 72

#artificialintelligence

I've been idly wondering lately about the problem of identifying high value samples to obtain for improving models, which seems to get at something similar under uncertainty. It isn't necessarily going to be economical to do an exhaustive sampling of whatever you're interested in, but collecting a few strategic datapoints could be relatively affordable and help a lot with inference. I also was wondering if some kind of hypothesis falsification module could be stapled onto gradient descent algorithms somehow. In terms of simulated annealing, because that's mentally easier for me, the idea would be that we want the temperature of nonlocal jumps to be hotter when the machine is making failed guesses, and we want it to be cooler when the gradient is behaving like the falsification module expects. The motivation for this is just that for inference, a lot of the time it is easier to learn things if you go out of your way to test your assumptions. Just having those assumptions be consistent with your observations is only a weak test of their value.


Faster Stochastic Algorithms via History-Gradient Aided Batch Size Adaptation

arXiv.org Machine Learning

Various schemes for adapting batch size have been recently proposed to accelerate stochastic algorithms. However, existing schemes either apply prescribed batch size adaption or require additional backtracking and condition verification steps to exploit the information along optimization path. In this paper, we propose an easy-to-implement scheme for adapting batch size by exploiting history stochastic gradients, based on which we propose the Adaptive batch size SGD (AbaSGD), AbaSVRG, and AbaSPIDER algorithms. To handle the dependence of the batch size on history stochastic gradients, we develop a new convergence analysis technique, and show that these algorithms achieve improved overall complexity over their vanilla counterparts. Moreover, their convergence rates are adaptive to the optimization landscape that the iterate experiences. Extensive experiments demonstrate that our algorithms substantially outperform existing competitive algorithms.


Collapsed Amortized Variational Inference for Switching Nonlinear Dynamical Systems

arXiv.org Machine Learning

We propose an efficient inference method for switching nonlinear dynamical systems. The key idea is to learn an inference network which can be used as a proposal distribution for the continuous latent variables, while performing exact marginalization of the discrete latent variables. This allows us to use the reparameterization trick, and apply end-to-end training with stochastic gradient descent. We show that the proposed method can successfully segment time series data (including videos) into meaningful "regimes", by using the piece-wise nonlinear dynamics.


Adaptive gradient descent without descent

arXiv.org Machine Learning

Yura Malitsky Konstantin Mishchenko † Abstract We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don't increase the stepsize too fast and 2) don't overstep the local curvature. By following these rules, you get a method adaptive to the local geometry, with convergence guarantees depending only on smoothness in a neighborhood of a solution. Given that the problem is convex, our method will converge even if the global smoothness constant is infinity. As an illustration, it can minimize arbitrary continuously twice-differentiable convex function. We examine its performance on a range of convex and nonconvex problems, including matrix factorization and training of ResNet-18. 1 Introduction Since the early days of optimization it was evident that there is a need for algorithms that are as independent from the user as possible. First-order methods have proven to be versatile and efficient in a wide range of applications, but one drawback has been present all that time: the stepsize. Despite some certain success stories, line search procedures and adaptive online methods have not removed the need to manually tune the optimization parameters. Even in smooth convex optimization, which is often believed to be much simpler than the nonconvex counterpart, robust rules for stepsize selection have been elusive. The purpose of this work is to remedy this deficiency. The problem formulation that we consider is the basic unconstrained optimization problem min x R d f (x), (1) where f: R d R is a differentiable function. Throughout the paper we assume that (1) has a solution and we denote its optimal value by f . The simplest and most known approach to this problem is the gradient descent method (GD), whose origin can be traced back to Cauchy [7,20]. Although it is probably the oldest optimization method, it continues to play a central role in modern algorithmic theory and applications. Its definition can be written in a mere one line, x k 1 x k λ f (x k), k 0, (2) where x 0 R d is arbitrary and λ 0 .


Sparsification as a Remedy for Staleness in Distributed Asynchronous SGD

arXiv.org Machine Learning

Large scale machine learning is increasingly relying on distributed optimization, whereby several machines contribute to the training process of a statistical model. While there exist a large literature on stochastic gradient descent (SGD) and variants, the study of countermeasures to mitigate problems arising in asynchronous distributed settings are still in their infancy. The key question of this work is whether sparsification, a technique predominantly used to reduce communication overheads, can also mitigate the staleness problem that affects asynchronous SGD. We study the role of sparsification both theoretically and empirically. Our theory indicates that, in an asynchronous, non-convex setting, the ergodic convergence rate of sparsified SGD matches the known result $\mathcal{O} \left( 1/\sqrt{T} \right)$ of non-convex SGD. We then carry out an empirical study to complement our theory and show that, in practice, sparsification consistently improves over vanilla SGD and current alternatives to mitigate the effects of staleness.