On the Theory of Continual Learning with Gradient Descent for Neural Networks

Taheri, Hossein, Ghosh, Avishek, Mazumdar, Arya

Oct-8-2025–arXiv.org Machine Learning

Gradient-based methods are the primary approach for training ne ural networks. In recent years, research in learning theory has shown that neural networks can efficiently lea rn various data classes using empirical risk minimization (ERM) methods. In many real-world settings, data a rrive sequentially in a non-stationary manner, requiring the learner to maintain performance on past tas ks while acquiring new capabilities. In such cases, a learning model must be continually learnable, meaning it should retain previously acquired knowledge when trained on new tasks. On the other hand, various le arning systems, including deep learning architectures, can be prone to catastrophic forgetting, that is, updating a model on new data causes a dramatic drop in performance on previously learned tasks [ McCloskey and Cohen, 1989, Goodfellow et al., 2013 ]. The goal of continual (lifelong) learning is to develop models and methods that, even without retraining on old data, experience minimal forgetting when incorporating new inform ation. Despite deep learning's ubiquity, characterizing the power and limitat ions of neural networks is still an ongoing research direction. While several recent works aim to unde rstand the power of gradient descent (GD) for training neural networks with stylized data distributions, these works are still limited to single-task scenarios (for some examples see [ Du et al., 2019, Bartlett et al., 2021, Abbe et al., 2022 ]).

continual learning, neural network, training loss, (11 more...)

arXiv.org Machine Learning

Oct-8-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > San Diego County > San Diego (0.04)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Education > Educational Setting (0.34)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning > Gradient Descent (1.00)
  - Neural Networks > Deep Learning (0.68)