What Do Self-Supervised Vision Transformers Learn?
Park, Namuk, Kim, Wonjae, Heo, Byeongho, Kim, Taekyung, Yun, Sangdoo
–arXiv.org Artificial Intelligence
We present a comparative study on how and why contrastive learning (CL) and masked image modeling (MIM) differ in their representations and in their performance of downstream tasks. In particular, we demonstrate that self-supervised Vision Transformers (ViTs) have the following properties: (1) CL trains selfattentions to capture longer-range global patterns than MIM, such as the shape of an object, especially in the later layers of the ViT architecture. This CL property helps ViTs linearly separate images in their representation spaces. However, it also makes the self-attentions collapse into homogeneity for all query tokens and heads. Such homogeneity of self-attention reduces the diversity of representations, worsening scalability and dense prediction performance. Since low-and high-frequency information respectively represent shapes and textures, CL is more shape-oriented and MIM more texture-oriented. Upon these analyses, we find that CL and MIM can complement each other and observe that even the simplest harmonization can help leverage the advantages of both methods. Contrastive Learning (CL) (He et al., 2020; Chen et al., 2020a;b; 2021) has been the most popular self-supervised learning methods until recently. It aims to learn the invariant semantics of two random views (Tian et al., 2020a;b) by making global projections of representations similar for positive samples and dissimilar for negative samples. Since CL exploits the globally projected representations to contrast each other, it can be deemed as an "image-level" self-supervised learning approach. Deviating from CL, masked image modeling (MIM) (Bao et al., 2022; Xie et al., 2022b; He et al., 2022) has risen as a strong competitor of CL in the era of Vision Transformers (ViTs) (Dosovitskiy et al., 2021) with its impressive performances of downstream tasks. MIM trains ViTs by reconstructing the correct semantics of masked input patches. Unlike CL, it learns the semantics of patch tokens and this can be deemed as a "token-level" self-supervised learning approach. Since MIM outperforms CL in fine-tuning accuracy, it may appear prima facie as a more effective pre-training method than CL. However, a different trend is observed for linear probing accuracy with CL outperforming MIM (See Figure 1). For further exposition on CL and MIM, we refer the reader to Appendix B. Then, which method--CL or MIM--should we use for the self-supervised learning of ViTs? Although both methods are widely used, little is known about what they learn.
arXiv.org Artificial Intelligence
May-1-2023