Grokking phase transitions in learning local rules with gradient descent
Žunkovič, Bojan, Ilievski, Enej
–arXiv.org Artificial Intelligence
Despite recent progress in understanding the double descend phenomena [1, 2, 3, 4] we still do not have a complete theory of generalisation in over-parameterised models. Two recent empirical observations, neural collapse [5] and grokking (generalisation beyond over-fitting) [6], can help us understand the training and generalisation properties of over-parameterised models. Neural collapse occurs in the terminal phase of training, i.e. the phase with zero train error. It refers to the collapse of the N dimensional, last-layer features (input to the last/classification layer) [5] to a (C 1)- dimensional equiangular tight frame (ETF) structure, where C is the number of classes. The feature vectors converge towards the vertices of the ETF structure such that features for each class are close to one vertex.
arXiv.org Artificial Intelligence
Oct-26-2022