Vision Transformers provably learn spatial structure

Jan-19-2025, 07:18:04 GMT–Neural Information Processing Systems

Vision Transformers (ViTs) have recently achieved comparable or superior performance to Convolutional neural networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since ViTs discards spatial information by mixing patch embeddings and positional encodings and do not embed any visual inductive bias (e.g.\ spatial locality). Yet, recent work showed that while minimizing their training loss, ViTs specifically learn spatially delocalized patterns. This raises a central question: how do ViTs learn this pattern by solely minimizing their training loss using gradient-based methods from \emph{random initialization}? We propose a structured classification dataset and a simplified ViT model to provide preliminary theoretical justification of this phenomenon.

attention mechanism, spatial structure, vision transformer provably, (2 more...)

Neural Information Processing Systems

Jan-19-2025, 07:18:04 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Vision (0.98)
  - Machine Learning > Neural Networks (0.62)