LearningEfficientVisionTransformersvia Fine-GrainedManifoldDistillation

Neural Information Processing Systems 

In the past few years, transformers have achieved promising performance on various computer vision tasks. Unfortunately, the immense inference overhead of most existing vision transformers withholds them from being deployed on edge devices such ascellphones andsmart watches. Knowledge distillation isa widely used paradigm for compressing cumbersome architectures into compact students via transferring information.