Label-efficient Training of Small Task-specific Models by Leveraging Vision Foundation Models

Vemulapalli, Raviteja, Pouransari, Hadi, Faghri, Fartash, Mehta, Sachin, Farajtabar, Mehrdad, Rastegari, Mohammad, Tuzel, Oncel

arXiv.org Artificial Intelligence 

Large Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high memory and compute requirements, these models cannot be deployed in resource constrained settings. This raises an important question: How can we utilize the knowledge from a large VFM to train a small task-specific model for a new target task with limited labeled training data? In this work, we answer this question by proposing a simple and highly effective task-oriented knowledge transfer approach to leverage pretrained VFMs for effective training of small task-specific models. Our experimental results on four target tasks under limited labeled data settings show that the proposed knowledge transfer approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining and supervised ImageNet pretraining by 1-10.5%, 2-22% and 2-14%, respectively. We also show that the dataset used for transferring knowledge has a significant effect on the final target task performance, and propose an image retrieval-based approach for curating effective transfer sets. Currently, the computer vision community is witnessing the emergence of various vision and multimodal foundation models pretrained on massive datasets (Radford et al., 2021; Yuan et al., 2021; Alayrac et al., 2022; Kirillov et al., 2023; Oquab et al., 2023; Li et al., 2023b; Wang et al., 2023b). These models have been shown to work well for many downstream computer vision tasks, especially, when task-specific labeled data is limited (Radford et al., 2021). While a single large foundation model could serve many applications, it cannot be directly used in resource constrained settings due to its high memory and compute requirements. Also, many real-world applications such as autonomous driving, medical image diagnostics, and industrial automation, focus on specific tasks and need small task-specific models rather than a large foundation model.