Projection Head is Secretly an Information Bottleneck

Ouyang, Zhuo, Hu, Kaiwen, Zhang, Qi, Wang, Yifei, Wang, Yisen

arXiv.org Artificial Intelligence 

Recently, contrastive learning has risen to be a promising paradigm for extracting meaningful data representations. Among various special designs, adding a projection head on top of the encoder during training and removing it for downstream tasks has proven to significantly enhance the performance of contrastive learning. However, despite its empirical success, the underlying mechanism of the projection head remains under-explored. In this paper, we develop an in-depth theoretical understanding of the projection head from the information-theoretic perspective. By establishing the theoretical guarantees on the downstream performance of the features before the projector, we reveal that an effective projector should act as an information bottleneck, filtering out the information irrelevant to the contrastive objective. Based on theoretical insights, we introduce modifications to projectors with training and structural regularizations. We believe our theoretical understanding on the role of the projection head will inspire more principled and advanced designs in this field. In recent years, contrastive learning has emerged as a promising representation learning paradigm and exhibited impressive performance without supervised labels (Chen et al., 2020; He et al., 2020; Zbontar et al., 2021). The core idea of contrastive learning is quite simple, that is to pull the augmented views of the same samples (i.e., positive samples) together while pushing the independent samples (i.e., negative samples) away. To improve the downstream performance of contrastive learning, researchers have proposed various special training objectives and architecture designs (Grill et al., 2020; Wang et al., 2021; Guo et al., 2023; Wang et al., 2023; 2024; Du et al., 2024). Among them, one of the most widely-used techniques is the projection head (i.e., projector) (Chen et al., 2020), which is a shallow layer following the backbone during pretraining and is discarded in downstream tasks like image classification and object detection. It has been shown that the features before the projector (denoted as encoder features) exhibit much better downstream performance than the features after the projector (denoted as projector features) across various applications (Jing et al., 2021; Gupta et al., 2022). Inspired by the success of the projection head in contrastive learning, researchers also extend this architecture to other representation learning paradigms and achieve significant improvements (Sariyildiz et al., 2022; Zhou et al., 2021). However, although the projection head has been widely adopted, the understanding of the underlying mechanism behind it is still quite limited. In this paper, we aim to establish a theoretical analysis of the relationship between the projection head and the downstream performance of contrastive learning.