On the Theory of Continual Learning with Gradient Descent for Neural Networks

Taheri, Hossein, Ghosh, Avishek, Mazumdar, Arya

arXiv.org Machine Learning 

Gradient-based methods are the primary approach for training ne ural networks. In recent years, research in learning theory has shown that neural networks can efficiently lea rn various data classes using empirical risk minimization (ERM) methods. In many real-world settings, data a rrive sequentially in a non-stationary manner, requiring the learner to maintain performance on past tas ks while acquiring new capabilities. In such cases, a learning model must be continually learnable, meaning it should retain previously acquired knowledge when trained on new tasks. On the other hand, various le arning systems, including deep learning architectures, can be prone to catastrophic forgetting, that is, updating a model on new data causes a dramatic drop in performance on previously learned tasks [ McCloskey and Cohen, 1989, Goodfellow et al., 2013 ]. The goal of continual (lifelong) learning is to develop models and methods that, even without retraining on old data, experience minimal forgetting when incorporating new inform ation. Despite deep learning's ubiquity, characterizing the power and limitat ions of neural networks is still an ongoing research direction. While several recent works aim to unde rstand the power of gradient descent (GD) for training neural networks with stylized data distributions, these works are still limited to single-task scenarios (for some examples see [ Du et al., 2019, Bartlett et al., 2021, Abbe et al., 2022 ]).