Accelerating Training with Neuron Interaction and Nowcasting Networks

Knyazev, Boris, Moudgil, Abhinav, Lajoie, Guillaume, Belilovsky, Eugene, Lacoste-Julien, Simon

arXiv.org Machine Learning 

Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. However, learnable update rules can be costly and unstable to train and use. A simpler recently proposed approach to accelerate training is to use Adam for most of the optimization steps and periodically, only every few steps, nowcast (predict future) parameters. We show that in some networks, such as Transformers, neuron connectivity is non-trivial. Recently, Jang et al. (2023); Sinha et al. (2017) showed that parameters θ follow a predictable trend Figure 1: More popular learnable approaches to speed up optimization, such as "learning to optimize" (L2O), are recurrently applied at every step t (Andrychowicz et al., 2016; Metz et al., 2022). This structure has been shown critical for many parameter representation tasks, such as property prediction (Navon et al., 2023; Zhou et al., 2023; Kofinas et al., 2024). We use Adam throughout the paper, but our discussion and methods are in principle applicable to any optimizers that produce a trajectory of parameters, including SGD with/without momentum, AdamW, Adagrad, etc.