Vision Transformers provably learn spatial structure