Goto

Collaborating Authors

 excellent teacher


ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

Neural Information Processing Systems

In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation research, in the context of adopting mainstream large-scale visual recognition datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three closely coupled components namely *cross attention projector*, *dual-view feature mimicking* and *teacher parameter perception* tailored to address the alignment problems stated above, we present a simple and effective knowledge distillation method, called *ScaleKD*. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets, achieving state-of-the-art knowledge distillation performance. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models.


Webinar - Statistical hypothesis testing with Python

#artificialintelligence

Clicking on "Register", you agree to our Privacy Policy In this webinar, some statistical hypothesis testing will be introduced both in theory and in practice using Python programming language. This webinar will be given remotely and streaming using LiveWebinar platform, which works on every updated internet browser. No installation is then required. The duration is about 60 minutes. The speaker will show some slides for the theoretical part of the content and will write code during the event using Google Colaboratory for the practical part.