Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Levi, Noam, Beck, Alon, Bar-Sinai, Yohai

arXiv.org Machine Learning 

Understanding the underlying correlations in complex datasets is the main challenge of statistical learning. Assuming that training and generalization data are drawn from a similar distribution, the discrepancy between training and generalization metrics quantifies how well a model extracts meaningful features from the training data, and what portion of its reasoning is based on idiosyncrasies in the training data. Traditionally, one would expect that once a neural network (NN) training converges to a low loss value, the generalization error should either plateau, for good models, or deteriorate for models that overfit. Surprisingly, [18] found that a shallow transformer trained on algorithmic datasets features drastically different dynamics. The network first overfits the training data, achieving low and stable training loss with high generalization error for an extended period, then suddenly and rapidly transitions to a perfect generalization phase. This counter-intuitive phenomenon, dubbed grokking, has recently garnered much attention and many underlying mechanisms have been proposed as possible explanations. These include the difficulty of representation learning [10], the scale of parameters at initialization [11], spikes in loss ("slingshots") [21], random walks among optimal solutions [15], and the simplicity of the generalising solution [16, Appendix E]. In this paper we take a different approach, leveraging the simplest possible models which still display grokking - linear estimators.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found