Accessing Vision Foundation Models at ImageNet-level Costs

Zhang, Yitian, Ma, Xu, Bai, Yue, Wang, Huan, Fu, Yun

Jul-14-2024–arXiv.org Artificial Intelligence

Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. Leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 15 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M). Code is available at here.

dataset, foundation model, proteus, (14 more...)

arXiv.org Artificial Intelligence

Jul-14-2024

arXiv.org PDF

Add feedback

Genre:
- Instructional Material (0.54)
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Image Understanding (0.49)
  - Machine Learning
    - Neural Networks (0.68)
    - Inductive Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found