Appendix

Feb-9-2026, 01:46:56 GMT–Neural Information Processing Systems

For vision transformers, we train linear probes on representations from individual tokens or on the representation averaged over all tokens, at the output of different transformer layers (each layer meaning a full transformer block including self-attention and MLP). Moreover, ResNets differ from ViTs in that the number of channels changes throughout the model, with fewer channels in the earlier layers. Wetrain alinear probe on each individual token and plot the average accuracy over the test set, in percent. Here we plot the results for each token a subset of layers in 3models: ViT-B/32 trained with aclassification token (CLS) or global average pooling (GAP), as well as a ResNet50. There are two main observations tobemade.

artificial intelligence, figurec, representation, (15 more...)

Neural Information Processing Systems

Feb-9-2026, 01:46:56 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.36)

Duplicate Docs Excel Report

Title
Appendix

Similar Docs Excel Report more

Title	Similarity	Source
None found