sharpness measure
Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold
Despite the popularity of Adam optimizer in practice, most theoretical analyses study SGD as a proxy and little is known about how the solutions found by Adam differ. In this paper, we show that Adam reduces a specific form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. When the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize via a continuous-time approximation using stochastic differential equations. We further illustrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, $\text{tr}(\textbf{H})$, whereas we prove that Adam minimizes $\text{tr}(\text{diag}(\textbf{H})^{1/2})$ instead. In solving sparse linear regression with diagonal linear networks, Adam provably achieves better sparsity and generalization than SGD due to this difference. Finally, we note that our proof framework applies not only to Adam but also to a broad class of adaptive gradient methods, including but not limited to RMSProp, Adam-mini, and Adalayer. This provides a unified perspective for analyzing how adaptive optimizers reduce sharpness and may offer insights for future optimizer design.
Appendix
This appendix is structured as follows: In Appendix A we provide more training details. In particular, we report the hyperparameters used for the CIFAR experiments in A.1 and for the ImageNet experiments in A.2. In A.3 we provide more details and a formal definition of the SAM-variants used throughout this paper. In Appendix B we show additional experimental results for: CIFAR in B.1, ImageNet in B.3, and a machine translation task in B.5. In B.2 we provide additional ablation studies for sparse perturbation SSAM approaches and in B.4 we extend the discussion on adversarial robustness.
Sharpness-Aware Training for Free
Modern deep neural networks (DNNs) have achieved state-of-the-art performances but are typically over-parameterized. The over-parameterization may result in undesirably large generalization error in the absence of other customized training strategies. Recently, a line of research under the name of Sharpness-Aware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error. However, SAM-like methods incur a two-fold computational overhead of the given base optimizer (e.g.
Rรฉnyi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization
Zhang, Qiaozhe, Sun, Jun, Zhang, Ruijie, Liu, Yingzhuang
Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{Rรฉnyi sharpness}, which is defined as the negative Rรฉnyi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{Rรฉnyi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (Rรฉnyi) entropy. To rigorously establish the relationship between generalization and (Rรฉnyi) sharpness, we provide several generalization bounds in terms of Rรฉnyi sharpness, by taking advantage of the reparametrization invariance property of Rรฉnyi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the Rรฉnyi sharpness and generalization. Moreover, we propose to use a variant of Rรฉnyi Sharpness as regularizer during training, i.e., Rรฉnyi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5\%, compared against the classical SAM method.
Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It
da Silva, Marvin F., Dangel, Felix, Oore, Sageev
The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold. In practice, we need to resort to approximating the geodesics. Doing so up to first order yields existing adaptive sharpness measures, and we demonstrate that including higher-order terms is crucial to recover correlation with generalization. We present results on diagonal networks with synthetic data, and show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.