Goto

Collaborating Authors

 Gradient Descent


Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

arXiv.org Machine Learning

Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of $c$ and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$ and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments.


Dataset Distillation with Convexified Implicit Gradients

arXiv.org Machine Learning

We propose a new dataset distillation algorithm using reparameterization and convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art. To this end, we first formulate dataset distillation as a bi-level optimization problem. Then, we show how implicit gradients can be effectively used to compute meta-gradient updates. We further equip the algorithm with a convexified approximation that corresponds to learning on top of a frozen finite-width neural tangent kernel. Finally, we improve bias in implicit gradients by parameterizing the neural network to enable analytical computation of final-layer parameters given the body parameters. RCIG establishes the new state-of-the-art on a diverse series of dataset distillation tasks. Notably, with one image per class, on resized ImageNet, RCIG sees on average a 108\% improvement over the previous state-of-the-art distillation algorithm. Similarly, we observed a 66\% gain over SOTA on Tiny-ImageNet and 37\% on CIFAR-100.


Training shallow ReLU networks on noisy data using hinge loss: when do we overfit and is it benign?

arXiv.org Artificial Intelligence

We study benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification. In particular, we consider linearly separable data for which a relatively small proportion of labels are corrupted or flipped. We identify conditions on the margin of the clean data that give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; non-benign overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly. Our analysis provides a fine-grained description of the dynamics of neurons throughout training and reveals two distinct phases: in the first phase clean points achieve close to zero loss, in the second phase clean points oscillate on the boundary of zero loss while corrupt points either converge towards zero loss or are eventually zeroed by the network. We prove these results using a combinatorial approach that involves bounding the number of clean versus corrupt updates across these phases of training.


Double Averaging and Gradient Projection: Convergence Guarantees for Decentralized Constrained Optimization

arXiv.org Artificial Intelligence

We consider a generic decentralized constrained optimization problem over static, directed communication networks, where each agent has exclusive access to only one convex, differentiable, local objective term and one convex constraint set. For this setup, we propose a novel decentralized algorithm, called DAGP (Double Averaging and Gradient Projection), based on local gradients, projection onto local constraints, and local averaging. We achieve global optimality through a novel distributed tracking technique we call distributed null projection. Further, we show that DAGP can be used to solve unconstrained problems with non-differentiable objective terms with a problem reduction scheme. Assuming only smoothness of the objective terms, we study the convergence of DAGP and establish sub-linear rates of convergence in terms of feasibility, consensus, and optimality, with no extra assumption (e.g. strong convexity). For the analysis, we forego the difficulties of selecting Lyapunov functions by proposing a new methodology of convergence analysis in optimization problems, which we refer to as aggregate lower-bounding. To demonstrate the generality of this method, we also provide an alternative convergence proof for the standard gradient descent algorithm with smooth functions. Finally, we present numerical results demonstrating the effectiveness of our proposed method in both constrained and unconstrained problems. In particular, we propose a distributed scheme by DAGP for the optimal transport problem with superior performance and speed.


How to Scale Your EMA

arXiv.org Machine Learning

Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings.


Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

arXiv.org Artificial Intelligence

We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $\mathcal{O}(K^{-2(\alpha - 1)/\alpha})$, when the stochastic gradients have finite moments of order $\alpha \in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup.


Loss Dynamics of Temporal Difference Reinforcement Learning

arXiv.org Machine Learning

Reinforcement learning has been successful across several applications in which agents have to learn to act in environments with sparse feedback. However, despite this empirical success there is still a lack of theoretical understanding of how the parameters of reinforcement learning models and the features used to represent states interact to control the dynamics of learning. In this work, we use concepts from statistical physics, to study the typical case learning curves for temporal difference learning of a value function with linear function approximators. Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We find that the stochastic semi-gradient noise due to subsampling the space of possible episodes leads to significant plateaus in the value error, unlike in traditional gradient descent dynamics. We study how learning dynamics and plateaus depend on feature structure, learning rate, discount factor, and reward function. We then analyze how strategies like learning rate annealing and reward shaping can favorably alter learning dynamics and plateaus. To conclude, our work introduces new tools to open a new direction towards developing a theory of learning dynamics in reinforcement learning.


Spectral Evolution and Invariance in Linear-width Neural Networks

arXiv.org Machine Learning

We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.


Autonomous Learning of Generative Models with Chemical Reaction Network Ensembles

arXiv.org Artificial Intelligence

Can a micron sized sack of interacting molecules autonomously learn an internal model of a complex and fluctuating environment? We draw insights from control theory, machine learning theory, chemical reaction network theory, and statistical physics to develop a general architecture whereby a broad class of chemical systems can autonomously learn complex distributions. Our construction takes the form of a chemical implementation of machine learning's optimization workhorse: gradient descent on the relative entropy cost function. We show how this method can be applied to optimize any detailed balanced chemical reaction network and that the construction is capable of using hidden units to learn complex distributions. This result is then recast as a form of integral feedback control. Finally, due to our use of an explicit physical model of learning, we are able to derive thermodynamic costs and trade-offs associated to this process.


Parameter-Agnostic Optimization under Relaxed Smoothness

arXiv.org Machine Learning

Tuning hyperparameters, such as the stepsize, presents a major challenge of training machine learning models. To address this challenge, numerous adaptive optimization algorithms have been developed that achieve near-optimal complexities, even when stepsizes are independent of problem-specific parameters, provided that the loss function is $L$-smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$-smoothness, all existing convergence results still necessitate tuning of the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum (NSGD-M) can achieve a (nearly) rate-optimal complexity without prior knowledge of any problem parameter, though this comes at the cost of introducing an exponential term dependent on $L_1$ in the complexity. We further establish that this exponential term is inevitable to such schemes by introducing a theoretical framework of lower bounds tailored explicitly for parameter-agnostic algorithms. Interestingly, in deterministic settings, the exponential factor can be neutralized by employing Gradient Descent with a Backtracking Line Search. To the best of our knowledge, these findings represent the first parameter-agnostic convergence results under the generalized smoothness condition. Our empirical experiments further confirm our theoretical insights.