Vision Transformers provably learn spatial structure

Aug-19-2025, 20:37:22 GMT–Neural Information Processing Systems

We propose a spatially structured dataset and a simplified ViT model. In this model, the attention matrix solely depends on the positional encodings. We call this mechanism the positional attention mechanism.

arxiv preprint arxiv, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Aug-19-2025, 20:37:22 GMT

Conferences PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Japan
  - Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)

Genre:
- Research Report (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
f69707de866eb0805683d3521756b73f-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found