What Happens During Finetuning of Vision Transformers: An Invariance Based Investigation

Merlin, Gabriele, Nanda, Vedant, Rawal, Ruchit, Toneva, Mariya

Jul-12-2023–arXiv.org Artificial Intelligence

The pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task, becoming commonplace across many areas of machine learning. While pretraining is empirically observed to be beneficial for a range of tasks, there is not a clear understanding yet of the reasons for this effect. In this work, we examine the relationship between pretrained vision transformers and the corresponding finetuned versions on several benchmark datasets and tasks. We present new metrics that specifically investigate the degree to which invariances learned by a pretrained model are retained or forgotten during finetuning. Using these metrics, we present a suite of empirical findings, including that pretraining induces transferable invariances in shallow layers and that invariances from deeper pretrained layers are compressed towards shallower layers during finetuning. Together, these findings contribute to understanding some of the reasons for the successes of pretrained models and the changes that a pretrained model undergoes when finetuned on a downstream task. In recent years much progress in deep learning has been driven by the reuse of models that were pretrained on large amounts of data. This is usually achieved by finetuning their parameters using a smaller amount of data from a target downstream task. This pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task, and has become commonplace across many areas of machine learning, including natural language processing (Howard & Ruder, 2018) and computer vision (Girshick et al., 2014). While pretraining is empirically observed to be beneficial for a range of tasks, there is not a clear understanding yet of the reasons for this effect. Previous work has empirically examined various conditions for pretraining and found that for a given budget of pre-training images, training with fewer classes, but more images per class performs better (Huh et al., 2016). Pretraining has also been posited to elicit an accelerated convergence during finetuning (Kornblith et al., 2019b), suggesting that during pretraining, models learn transferable representations, particularly when the finetuning task domain is similar to the pretraining task.

invariance, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jul-12-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.93)

Genre:
- Research Report (1.00)

Industry:
- Government > Regional Government > North America Government > United States Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.88)
  - Natural Language (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found