Vision Transformers provably learn spatial structure

Neural Information Processing Systems 

We propose a spatially structured dataset and a simplified ViT model. In this model, the attention matrix solely depends on the positional encodings. We call this mechanism the positional attention mechanism.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found