On the Surprising Effectiveness of Attention Transfer for Vision Transformers

May-27-2025, 17:11:21 GMT–Neural Information Processing Systems

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet.

attention transfer, surprising effectiveness, vision transformer, (5 more...)

Neural Information Processing Systems

May-27-2025, 17:11:21 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.98)
  - Vision (0.65)