What Happens During Finetuning of Vision Transformers: An Invariance Based Investigation
Merlin, Gabriele, Nanda, Vedant, Rawal, Ruchit, Toneva, Mariya
–arXiv.org Artificial Intelligence
The pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task, becoming commonplace across many areas of machine learning. While pretraining is empirically observed to be beneficial for a range of tasks, there is not a clear understanding yet of the reasons for this effect. In this work, we examine the relationship between pretrained vision transformers and the corresponding finetuned versions on several benchmark datasets and tasks. We present new metrics that specifically investigate the degree to which invariances learned by a pretrained model are retained or forgotten during finetuning. Using these metrics, we present a suite of empirical findings, including that pretraining induces transferable invariances in shallow layers and that invariances from deeper pretrained layers are compressed towards shallower layers during finetuning. Together, these findings contribute to understanding some of the reasons for the successes of pretrained models and the changes that a pretrained model undergoes when finetuned on a downstream task. In recent years much progress in deep learning has been driven by the reuse of models that were pretrained on large amounts of data. This is usually achieved by finetuning their parameters using a smaller amount of data from a target downstream task. This pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task, and has become commonplace across many areas of machine learning, including natural language processing (Howard & Ruder, 2018) and computer vision (Girshick et al., 2014). While pretraining is empirically observed to be beneficial for a range of tasks, there is not a clear understanding yet of the reasons for this effect. Previous work has empirically examined various conditions for pretraining and found that for a given budget of pre-training images, training with fewer classes, but more images per class performs better (Huh et al., 2016). Pretraining has also been posited to elicit an accelerated convergence during finetuning (Kornblith et al., 2019b), suggesting that during pretraining, models learn transferable representations, particularly when the finetuning task domain is similar to the pretraining task.
arXiv.org Artificial Intelligence
Jul-12-2023