Appendix

Neural Information Processing Systems 

Additional details and results from the different sections are included below. The solution can be recovered efficiently in closed form. For vision transformers, we train linear probes on representations from individual tokens or on the representation averaged over all tokens, at the output of different transformer layers (each layer meaning a full transformer block including self-attention and MLP). Top shows CKA heatmap for ViT -B/32, where we can also observe strong similarity between lower and higher layers and the grid like, uniform representation structure. In Figures C.1, C.2, and C.3, we provide full plots of effective receptive fields of all layers of ViT -B/32, ResNet-50, and ViT -L/16, taken after the residual connections as in Figure 6 in the text.