Orthogonal Gradient Descent for Continual Learning
Farajtabar, Mehrdad, Azizan, Navid, Mott, Alex, Li, Ang
Orthogonal Gradient Descent for Continual LearningMehrdad Farajtabar Navid Azizan 1 Alex Mott Ang Li DeepMind CalTech DeepMind DeepMind Abstract Neural networks are achieving state of the art and sometimes superhuman performance on learning tasks across a variety of domains. Whenever these problems require learning in a continual or sequential manner, however, neural networks suffer from the problem of catastrophic forgetting; they forget how to solve previous tasks after being trained on a new task, despite having the essential capacity to solve both tasks if they were trained on both simultaneously. In this paper, we propose to address this issue from a parameter space perspective and study an approach to restrict the direction of the gradient updates to avoid forgetting previously-learned data. We present the Orthogonal Gradient Descent (OGD) method, which accomplishes this goal by projecting the gradients from new tasks onto a subspace in which the neural network output on previous task does not change and the projected gradient is still in a useful direction for learning the new task. Our approach utilizes the high capacity of a neural network more efficiently and does not require storing the previously learned data that might raise privacy concerns. Experiments on common benchmarks reveal the effectiveness of the proposed OGD method. 1 Introduction One critical component of intelligence is the ability to learn continuously, when new information is constantly available but previously presented information is unavailable to retrieve. Despite their ubiquity in the real world, these problems have posed a longstanding challenge to artificial intelligence (Thrun and Mitchell, 1995; Hassabis et al., 2017).Correspondence to farajtabar@google.com. 1 Work done during an internship at DeepMind. A typical neural network training procedure over a sequence of different tasks usually results in degraded performance on previously trained tasks if the model could not revisit the data of previous tasks. This phenomenon is called catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990; French, 1999). Ideally, an intelligent agent should be able to learn consecutive tasks without degrading its performance on those already learned. With the deep learning renaissance (Krizhevsky et al., 2012; Hinton et al., 2006; Si-monyan and Zisserman, 2014) this problem has been revived (Srivastava et al., 2013; Goodfellow et al., 2013) with many followup studies (Parisi et al., 2019). One probable reason for this phenomenon is that neural networks are usually trained by Stochastic Gradient Descent (SGD)--or its variants--where the op-timizers produce gradients that are oblivious to past knowledge.
Oct-15-2019
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Education (0.68)
- Health & Medicine (0.46)
- Information Technology > Security & Privacy (0.34)
- Technology: