On the Surprising Effectiveness of Attention Transfer for Vision Transformers Yuandong Tian Beidi Chen Carnegie Mellon University FAIR Carnegie Mellon University Deepak Pathak

Neural Information Processing Systems 

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations.