arcco
Appendix
Our results heavily rely on the specific nature of the periodic activation function, so a natural question is to which extent our results can be extended beyond the single periodic neuron class. For lower bounds, a challenging but very interesting generalization would be to establish the cryptographic-hardness of learning certain family of GLMs whose activation function does not need to be periodic. A potentially easier route forward on this direction, would be to consider the Hermite decomposition of the activation function, similar to [A3], and establish lower bounds on the performance of low-degree methods [A23], of SGD [A3], or of local search methods methods [A15], for activation functions whose low-degree Hermite coefficients are exponentially small. For upper bounds, we believe that our proposed LLL-based algorithm may be extended beyond learning even periodic activation functions, such as the cosine activation, by appropriately post-processing the measurements, but leave this for future work. Furthermore, it would be interesting to better understand (empirically or analytically) the noise tolerance of our LLL-based algorithm for "low-frequency" activation functions, such as the absolute value underlying the phase retrieval problem which has "zero" frequency.
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
Pasand, Ali Saheb, Dohmatob, Elvis
Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.