Wide Neural Networks Forget Less Catastrophically

Mirzadeh, Seyed Iman, Chaudhry, Arslan, Hu, Huiyi, Pascanu, Razvan, Gorur, Dilan, Farajtabar, Mehrdad

arXiv.org Artificial Intelligence 

Machine learning is relying more and more on training large models on large static datasets to reach impressive results (Kaplan et al., 2020; Lazaridou et al., 2021; Hombaiah et al., 2021). However, the real world is changing over time and new information is becoming available at an unprecedented rate (Lazaridou et al., 2021; Hombaiah et al., 2021). In such real world problems, the learning agent is exposed to a continuous stream of data, with potentially changing data distribution, and it has to absorb new information efficiently while not being able to iterate on previous data as freely as wanted due to time, sample, compute, privacy, or environmental complexity issues (Parisi et al., 2018). To overcome these inefficiencies, fields, such as Continual learning (CL) (Ring et al., 1994) or lifelong learning (Thrun, 1995) are gaining a lot of attention recently. One of the key challenges in continual learning models is the abrupt erasure of previous knowledge, referred to as Catastrophic Forgetting (CF) (McCloskey and Cohen, 1989). Alleviating catastrophic forgetting has attracted a lot of attention lately, and many interesting solutions are proposed to partly overcome the issue (e.g., Toneva et al., 2018; Nguyen et al., 2019; Hsu et al., 2018; Li et al., 2019; Wallingford et al., 2020). These solutions vary in degree of complexity from simple replay-based methods to complicated regularization or network expansion-based methods. Unfortunately, however, there is not much fundamental understanding of the intrinsic properties of neural networks that affects continual learning performance through catastrophic forgetting or forward/backward transfer (Mirzadeh et al., 2020). Work done during an internship at DeepMind.