On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Neural Information Processing Systems 

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations.