Dissecting Query-Key Interaction in Vision Transformers

Neural Information Processing Systems 

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object.