Goto

Collaborating Authors

 Gradient Descent


Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function

Neural Information Processing Systems

This article suggests that deterministic Gradient Descent, which does not use any stochastic gradient approximation, can still exhibit stochastic behaviors. In particular, it shows that if the objective function exhibit multiscale behaviors, then in a large learning rate regime which only resolves the macroscopic but not the microscopic details of the objective, the deterministic GD dynamics can become chaotic and convergent not to a local minimizer but to a statistical distribution. In this sense, deterministic GD resembles stochastic GD even though no stochasticity is injected. A sufficient condition is also established for approximating this long-time statistical limit by a rescaled Gibbs distribution, which for example allows escapes from local minima to be quantified. Both theoretical and numerical demonstrations are provided, and the theoretical part relies on the construction of a stochastic map that uses bounded noise (as opposed to Gaussian noise).


Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

Neural Information Processing Systems

In the context of statistical supervised learning, the noiseless linear model assumes that there exists a deterministic linear relation Y \langle \theta_*, \Phi(U) \rangle between the random output Y and the random feature vector \Phi(U), a potentially non-linear transformation of the inputs U . We analyze the convergence of single-pass, fixed step-size stochastic gradient descent on the least-square risk under this model. The convergence of the iterates to the optimum \theta_* and the decay of the generalization error follow polynomial convergence rates with exponents that both depend on the regularities of the optimum \theta_* and of the feature vectors \Phi(U) . We interpret our result in the reproducing kernel Hilbert space framework. As a special case, we analyze an online algorithm for estimating a real function on the unit hypercube from the noiseless observation of its value at randomly sampled points; the convergence depends on the Sobolev smoothness of the function and of a chosen kernel.


Online Robust Regression via SGD on the l1 loss

Neural Information Processing Systems

We consider the robust linear regression problem in the online setting where we have access to the data in a streaming manner, one data point after the other. More specifically, for a true parameter \theta *, we consider the corrupted Gaussian linear model y \varepsilon b where the adversarial noise b can take any value with probability \eta and equals zero otherwise. We consider this adversary to be oblivious (i.e., b independent of the data) since this is the only contamination model under which consistency is possible. Current algorithms rely on having the whole data at hand in order to identify and remove the outliers. In contrast, we show in this work that stochastic gradient descent on the l1 loss converges to the true parameter vector at a \tilde{O}( 1 / (1 - \eta) 2 n) rate which is independent of the values of the contaminated measurements.


Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks

Neural Information Processing Systems

Natural gradient descent has proven very effective at mitigating the catastrophic effects of pathological curvature in the objective function, but little is known theoretically about its convergence properties, especially for \emph{non-linear} networks. In this work, we analyze for the first time the speed of convergence to global optimum for natural gradient descent on non-linear neural networks with the squared error loss. We identify two conditions which guarantee the global convergence: (1) the Jacobian matrix (of network's output for all training cases w.r.t the parameters) is full row rank and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks (i.e. with one hidden layer), we prove that these two conditions do hold throughout the training under the assumptions that the inputs do not degenerate and the network is over-parameterized. We further extend our analysis to more general loss function with similar convergence property.


Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Sparse Neural Networks

Neural Information Processing Systems

The lottery ticket hypothesis (LTH) states that learning on a properly pruned network (the winning ticket) has improved test accuracy over the original unpruned network. Although LTH has been justified empirically in a broad range of deep neural network (DNN) involved applications like computer vision and natural language processing, the theoretical validation of the improved generalization of a winning ticket remains elusive. To the best of our knowledge, our work, for the first time, characterizes the performance of training a pruned neural network by analyzing the geometric structure of the objective function and the sample complexity to achieve zero generalization error. We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned, indicating the structural importance of a winning ticket. Moreover, as the algorithm for training a pruned neural network is specified as an (accelerated) stochastic gradient descent algorithm, we theoretically show that the number of samples required for achieving zero generalization error is proportional to the number of the non-pruned weights in the hidden layer.


Explicit loss asymptotics in the gradient descent training of neural networks

Neural Information Processing Systems

Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory of a wide network in a lazy training regime can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic expansion of the loss behaves as a power law L(t) \sim C t {-\xi} with exponent \xi expressed only through the data dimension, the smoothness of the activation function, and the class of function being approximated. Our results are based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss. Importantly, the techniques we employ do not require a specific form of the data distribution, for example Gaussian, thus making our findings sufficiently universal.


Projected Stein Variational Gradient Descent

Neural Information Processing Systems

The curse of dimensionality is a longstanding challenge in Bayesian inference in high dimensions. In this work, we propose a {projected Stein variational gradient descent} (pSVGD) method to overcome this challenge by exploiting the fundamental property of intrinsic low dimensionality of the data informed subspace stemming from ill-posedness of such problems. We adaptively construct the subspace using a gradient information matrix of the log-likelihood, and apply pSVGD to the much lower-dimensional coefficients of the parameter projection. The method is demonstrated to be more accurate and efficient than SVGD. It is also shown to be more scalable with respect to the number of parameters, samples, data points, and processor cores via experiments with parameters dimensions ranging from the hundreds to the tens of thousands.


Solving Min-Max Optimization with Hidden Structure via Gradient Descent Ascent

Neural Information Processing Systems

Many recent AI architectures are inspired by zero-sum games, however, the behavior of their dynamics is still not well understood. Inspired by this, we study standard gradient descent ascent (GDA) dynamics in a specific class of non-convex non-concave zero-sum games, that we call hidden zero-sum games. In this class, players control the inputs of smooth but possibly non-linear functions whose outputs are being applied as inputs to a convex-concave game. Unlike general zero-sum games, these games have a well-defined notion of solution; outcomes that implement the von-Neumann equilibrium of the hidden" convex-concave game. We provide conditions under which vanilla GDA provably converges not merely to local Nash, but the actual von-Neumann solution.


Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model

Neural Information Processing Systems

As the model size grows rapidly, fine-tuning the large pre-trained language model has become increasingly difficult due to its extensive memory usage. Previous works usually focus on reducing the number of trainable parameters in the network. While the model parameters do contribute to memory usage, the primary memory bottleneck during training arises from storing feature maps, also known as activations, as they are crucial for gradient calculation. Notably, machine learning models are typically trained using stochastic gradient descent.We argue that in stochastic optimization, models can handle noisy gradients as long as the gradient estimator is unbiased with reasonable variance.Following this motivation, we propose a new family of unbiased estimators called \sas, for matrix production with reduced variance, which only requires storing the sub-sampled activations for calculating the gradient.Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.By replacing the linear operation with our approximated one in transformers, we can achieve up to 2.7X peak memory reduction with almost no accuracy drop and enables up to 6.4\times larger batch size.Under the same hardware, \sas enables better down-streaming task performance by applying larger models and/or faster training speed with larger batch sizes.The code is available at https://anonymous.4open.science/r/WTACRS-A5C5/.


Global Convergence of Gradient Descent for Deep Linear Residual Networks

Neural Information Processing Systems

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an \varepsilon -optimal point in O\left( L 3 \log(1/\varepsilon) \right) iterations, which scales polynomially with the network depth L . Our result and the \exp(\Omega(L)) convergence time for the standard initialization (Xavier or near-identity) \cite{shamir2018exponential} together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when L is large.