Intelligent Learning Rate Distribution to reduce Catastrophic Forgetting in Transformers

Kenneweg, Philip, Schulz, Alexander, Schröder, Sarah, Hammer, Barbara

arXiv.org Artificial Intelligence 

Pretraining language models on large text corpora is a common practice in natural language processing. Fine-tuning of these models is then performed to achieve the best results on a variety of tasks. In this paper, we investigate the problem of catastrophic forgetting in transformer neural networks and question the common practice of fine-tuning with a flat learning rate for the entire network in this context. We perform a hyperparameter optimization process to find learning rate distributions that are better than a flat learning rate. We combine the learning rate distributions thus found and show that they generalize to better performance with respect to the problem of catastrophic forgetting.