Memorization in Self-Supervised Learning Improves Downstream Generalization
Wang, Wenhao, Kaleem, Muhammad Ahmad, Dziedzic, Adam, Backes, Michael, Papernot, Nicolas, Boenisch, Franziska
–arXiv.org Artificial Intelligence
Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data--often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations--both known in supervised learning as regularization techniques that reduce overfitting--still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks. In recent years, self-supervised learning (SSL) has emerged as a new potent learning paradigm. SSL encoders can be trained without reliance on labeled data, which is often hard and expensive to obtain. Instead, SSL leverages the existence of large amounts of unlabeled data--often scraped from the internet--to obtain state-of-the-art performance in various domains, ranging from computer vision (He et al., 2022; Chen et al., 2020; Chen & He, 2021; Caron et al., 2021) to natural language processing (Devlin et al., 2018; Radford et al.). Empirical studies suggest that SSL encoders can disclose information about their training data at inference time (Meehan et al., 2023). An unintended revelation of private information is often associated to machine learning models' ability to memorize their training data (Zhang et al., 2016; Arpit et al., 2017; Chatterjee, 2018; Carlini et al., 2019; 2021; 2022). Additionally, it was found that in supervised learning memorization happens in the feature extractor (encoder) layers (Feldman & Zhang, 2020; Maini et al., 2023). Those are exactly the type of layers that SSL trains. Yet, given that SSL differs significantly from supervised learning in terms of learning objective, data processing, and augmentation strength, it remains unclear whether the trends from supervised learning transfer to the self-supervised learning. Part of the work was done while the authors were at the University of Toronto and the Vector Institute. Higher memorization scores indicate stronger memorization. We observe that outliers and atypical examples experience higher memorization than more standard samples.
arXiv.org Artificial Intelligence
Jan-24-2024
- Country:
- North America > Canada > Ontario > Toronto (0.54)
- Genre:
- Research Report
- Experimental Study (0.66)
- New Finding (0.93)
- Research Report
- Industry:
- Information Technology > Security & Privacy (0.67)
- Technology: