Intelligent Learning Rate Distribution to reduce Catastrophic Forgetting in Transformers

Kenneweg, Philip, Schulz, Alexander, Schröder, Sarah, Hammer, Barbara

Mar-27-2024–arXiv.org Artificial Intelligence

Pretraining language models on large text corpora is a common practice in natural language processing. Fine-tuning of these models is then performed to achieve the best results on a variety of tasks. In this paper, we investigate the problem of catastrophic forgetting in transformer neural networks and question the common practice of fine-tuning with a flat learning rate for the entire network in this context. We perform a hyperparameter optimization process to find learning rate distributions that are better than a flat learning rate. We combine the learning rate distributions thus found and show that they generalize to better performance with respect to the problem of catastrophic forgetting.

dataset, learning rate, rate distribution, (15 more...)

arXiv.org Artificial Intelligence

Mar-27-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Germany (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)