Gradient Descent
Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality
Tang, Kejie, Liu, Weidong, Zhang, Yichen
SGD is a first-order optimization algorithm that approximates the expected loss by averaging the loss function over a mini-batch of training examples. At each iteration, the algorithm updates the model parameters in the direction of the negative gradient of the mini-batch loss, scaled by a learning rate parameter. While SGD is simple and easy to implement, it may suffer from slow convergence rates or oscillations in high-dimensional optimization problems, particularly when the loss function is illconditioned or noisy. Momentum-based methods enhance SGD by introducing an exponentially weighted moving average of the past gradients to the update rule, which serves to dampen oscillations and accelerate convergence. In particular, the momentum term introduces a form of inertia to the update process, allowing the algorithm to maintain a more consistent direction of movement even in the presence of noisy gradients. Several variants of momentum-based SGD have been proposed, such as Nesterov's accelerated gradient (NAG), Adagrad, and Adam, each with its own strengths and weaknesses.
Which Features are Learnt by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression
Xue, Yihao, Joshi, Siddharth, Gan, Eric, Chen, Pin-Yu, Mirzasoleiman, Baharan
Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision. However, supervised CL is prone to collapsing representations of subclasses within a class by not capturing all their features, and unsupervised CL may suppress harder class-relevant features by focusing on learning easy class-irrelevant features; both significantly compromise representation quality. Yet, there is no theoretical understanding of \textit{class collapse} or \textit{feature suppression} at \textit{test} time. We provide the first unified theoretically rigorous framework to determine \textit{which} features are learnt by CL. Our analysis indicate that, perhaps surprisingly, bias of (stochastic) gradient descent towards finding simpler solutions is a key factor in collapsing subclass representations and suppressing harder class-relevant features. Moreover, we present increasing embedding dimensionality and improving the quality of data augmentations as two theoretically motivated solutions to {feature suppression}. We also provide the first theoretical explanation for why employing supervised and unsupervised CL together yields higher-quality representations, even when using commonly-used stochastic gradient methods.
SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates
Frangella, Zachary, Rathore, Pratik, Zhao, Shipu, Udell, Madeleine
SketchySGD improves upon existing stochastic gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian and by introducing an automated stepsize that works well across a wide range of convex machine learning problems. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. Further, in the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems. We validate this improvement empirically with ridge regression experiments on real data. Numerical experiments on both ridge and logistic regression problems with dense and sparse data, show that SketchySGD equipped with its default hyperparameters can achieve comparable or better results than popular stochastic gradient methods, even when they have been tuned to yield their best performance. In particular, SketchySGD is able to solve an ill-conditioned logistic regression problem with a data matrix that takes more than $840$GB RAM to store, while its competitors, even when tuned, are unable to make any progress. SketchySGD's ability to work out-of-the box with its default hyperparameters and excel on ill-conditioned problems is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance and degrade in the presence of ill-conditioning.
Faster Margin Maximization Rates for Generic Optimization Methods
Wang, Guanghui, Hu, Zihao, Muthukumar, Vidya, Abernethy, Jacob
First-order optimization methods tend to inherently favor certain solutions over others when minimizing a given training objective with multiple local optima. This phenomenon, known as implicit bias, plays a critical role in understanding the generalization capabilities of optimization algorithms. Recent research has revealed that gradient-descent-based methods exhibit an implicit bias for the $\ell_2$-maximal margin classifier in the context of separable binary classification. In contrast, generic optimization methods, such as mirror descent and steepest descent, have been shown to converge to maximal margin classifiers defined by alternative geometries. However, while gradient-descent-based algorithms demonstrate fast implicit bias rates, the implicit bias rates of generic optimization methods have been relatively slow. To address this limitation, in this paper, we present a series of state-of-the-art implicit bias rates for mirror descent and steepest descent algorithms. Our primary technique involves transforming a generic optimization algorithm into an online learning dynamic that solves a regularized bilinear game, providing a unified framework for analyzing the implicit bias of various optimization methods. The accelerated rates are derived leveraging the regret bounds of online learning algorithms within this game framework.
Fast Offline Policy Optimization for Large Scale Recommendation
Sakhi, Otmane, Rohde, David, Gilotte, Alexandre
Personalised interactive systems such as recommender systems require selecting relevant items from massive catalogs dependent on context. Reward-driven offline optimisation of these systems can be achieved by a relaxation of the discrete problem resulting in policy learning or REINFORCE style learning algorithms. Unfortunately, this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems, severely limiting the usefulness of this method in practice. In this paper, we derive an approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show that our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.
Fast and Minimax Optimal Estimation of Low-Rank Matrices via Non-Convex Gradient Descent
Zhang, Gavin, Chiu, Hong-Ming, Zhang, Richard Y.
We study the problem of estimating a low-rank matrix from noisy measurements, with the specific goal of achieving minimax optimal error. In practice, the problem is commonly solved using non-convex gradient descent, due to its ability to scale to large-scale real-world datasets. In theory, non-convex gradient descent is capable of achieving minimax error. But in practice, it often converges extremely slowly, such that it cannot even deliver estimations of modest accuracy within reasonable time. On the other hand, methods that improve the convergence of non-convex gradient descent, through rescaling or preconditioning, also greatly amplify the measurement noise, resulting in estimations that are orders of magnitude less accurate than what is theoretically achievable with minimax optimal error. In this paper, we propose a slight modification to the usual non-convex gradient descent method that remedies the issue of slow convergence, while provably preserving its minimax optimality. Our proposed algorithm has essentially the same per-iteration cost as non-convex gradient descent, but is guaranteed to converge to minimax error at a linear rate that is immune to ill-conditioning. Using our proposed algorithm, we reconstruct a 60 megapixel dataset for a medical imaging application, and observe significantly decreased reconstruction error compared to previous approaches.
Stochastic Optimization under Distributional Drift
Cutler, Joshua, Drusvyatskiy, Dmitriy, Harchaoui, Zaid
We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. The efficiency estimates we obtain clearly decouple the contributions of optimization error, gradient noise, and time drift. Notably, we identify a low drift-to-noise regime in which the tracking efficiency of the proximal stochastic gradient method benefits significantly from a step decay schedule. Numerical experiments illustrate our results.
Neural networks trained with SGD learn distributions of increasing complexity
Refinetti, Maria, Ingrosso, Alessandro, Goldt, Sebastian
The ability of deep neural networks to generalise well even when they interpolate their training data has been explained using various "simplicity biases". These theories postulate that neural networks avoid overfitting by first learning simple functions, say a linear classifier, before learning more complex, non-linear functions. Meanwhile, data structure is also recognised as a key ingredient for good generalisation, yet its role in simplicity biases is not yet understood. Here, we show that neural networks trained using stochastic gradient descent initially classify their inputs using lower-order input statistics, like mean and covariance, and exploit higher-order statistics only later during training. We first demonstrate this distributional simplicity bias (DSB) in a solvable model of a neural network trained on synthetic data. We empirically demonstrate DSB in a range of deep convolutional networks and visual transformers trained on CIFAR10, and show that it even holds in networks pre-trained on ImageNet. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of Gaussian universality in learning.
Equalization in Dispersion-Managed Systems Using Learned Digital Back-Propagation
Abu-Romoh, Mohannad, Costa, Nelson, Jaouën, Yves, Napoli, Antonio, Pedro, João, Spinnler, Bernhard, Yousefi, Mansoor
In this paper, we investigate the use of the learned digital back-propagation (LDBP) for equalizing dual-polarization fiber-optic transmission in dispersion-managed (DM) links. LDBP is a deep neural network that optimizes the parameters of DBP using the stochastic gradient descent. We evaluate DBP and LDBP in a simulated WDM dual-polarization fiber transmission system operating at the bitrate of 256 Gbit/s per channel, with a dispersion map designed for a 2016 km link with 15% residual dispersion. Our results show that in single-channel transmission, LDBP achieves an effective signal-to-noise ratio improvement of 6.3 dB and 2.5 dB, respectively, over linear equalization and DBP. In WDM transmission, the corresponding $Q$-factor gains are 1.1 dB and 0.4 dB, respectively. Additionally, we conduct a complexity analysis, which reveals that a frequency-domain implementation of LDBP and DBP is more favorable in terms of complexity than the time-domain implementation. These findings demonstrate the effectiveness of LDBP in mitigating the nonlinear effects in DM fiber-optic transmission systems.
Modulate Your Spectrum in Self-Supervised Learning
Weng, Xi, Ni, Yunhao, Song, Tengwei, Luo, Jie, Anwer, Rao Muhammad, Khan, Salman, Khan, Fahad Shahbaz, Huang, Lei
Whitening loss provides theoretical guarantee in avoiding feature collapse for self-supervised learning (SSL) using joint embedding architectures. One typical implementation of whitening loss is hard whitening that designs whitening transformation over embedding and imposes the loss on the whitened output. In this paper, we propose spectral transformation (ST) framework to map the spectrum of embedding to a desired distribution during forward pass, and to modulate the spectrum of embedding by implicit gradient update during backward pass. We show that whitening transformation is a special instance of ST by definition, and there exist other instances that can avoid collapse by our empirical investigation. Furthermore, we propose a new instance of ST, called IterNorm with trace loss (INTL). We theoretically prove that INTL can avoid collapse and modulate the spectrum of embedding towards an equal-eigenvalue distribution during the course of optimization. Moreover, INTL achieves 76.6% top-1 accuracy in linear evaluation on ImageNet using ResNet-50, which exceeds the performance of the supervised baseline, and this result is obtained by using a batch size of only 256. Comprehensive experiments show that INTL is a promising SSL method in practice. The code is available at https://github.com/winci-ai/intl.