What Can Grokking Teach Us About Learning Under Nonstationarity?
Lyle, Clare, Sokar, Gharda, Pascanu, Razvan, Gyorgy, Andras
–arXiv.org Artificial Intelligence
In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to changes in the data stream; however, neural networks often exhibit primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks. Non-stationarity is ubiquitous in real-world applications of AI systems: datasets may grow over time, correlations may appear and then disappear as trends evolve, and AI systems themselves may take an active role in the generation of their own training data. In this paper, we will propose a framework for understanding and mitigating this degradation in generalization performance which connects three previously disparate phenomena: primacy bias, grokking, and feature-learning dynamics. Primacy bias: A neural network initially trained on one task is trained on a different data distribution and/or objective, and achieves worse performance than a randomly initialized network on the new task (Achille et al., 2017; Ash & Adams, 2020; Nikishin et al., 2022). Grokking: A model suddenly closes the generalization gap as a result of (possibly prolonged) further training after it has initially achieved perfect training accuracy (memorization) with poor test-time performance (Power et al., 2022). Feature learning: a network's ability to make nontrivial changes to its learned representation (a.k.a.
arXiv.org Artificial Intelligence
Jul-29-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States (0.28)
- Asia > Middle East
- Genre:
- Instructional Material (0.65)
- Research Report (0.64)
- Industry:
- Technology: