The Power of Contrast for Feature Learning: A Theoretical Analysis

Ji, Wenlong, Deng, Zhun, Nakada, Ryumei, Zou, James, Zhang, Linjun

arXiv.org Machine Learning 

Deep supervised learning has achieved great success in various applications, including computer vision (Krizhevsky et al., 2012), natural language processing (Devlin et al., 2018), and scientific computing (Han et al., 2018). However, its dependence on manually assigned labels, which is usually difficult and costly, has motivated research into alternative approaches to exploit unlabeled data. Self-supervised learning is a promising approach that leverages the unlabeled data itself as supervision and learns representations that are beneficial to potential downstream tasks. At a high level, there are two common approaches for feature extraction in self-supervised learning: generative and contrastive (Liu et al., 2021). Both approaches aim to learn latent representations of the original data, while the difference is that the generative approach focused on minimizing the reconstruction error from latent representations, and the contrastive approach targets to decrease the similarity between the representations of contrastive pairs. Recent works have shown the benefits of contrastive learning in practice (Chen et al., 2020a,b,c; He et al., 2020).