Memory-Based Learning
Title
We propose a novel framework to perform classification via deep learning in the presence of noisy annotations. When trained on noisy labels, deep neural networks have been observed to first fit the training data with clean labels during an "early learning" phase, before eventually memorizing the examples with false labels. We prove that early learning and memorization are fundamental phenomena in high-dimensional classification tasks, even in simple linear models, and give a theoretical explanation in this setting. Motivated by these findings, we develop a new technique for noisy classification tasks, which exploits the progress of the early learning phase. In contrast with existing approaches, which use the model output during early learning to detect the examples with clean labels, and either ignore or attempt to correct the false labels, we take a different route and instead capitalize on early learning via regularization. There are two key elements to our approach. First, we leverage semi-supervised learning techniques to produce target probabilities based on the model outputs. Second, we design a regularization term that steers the model towards these targets, implicitly preventing memorization of the false labels. The resulting framework is shown to provide robustness to noisy annotations on several standard benchmarks and real-world datasets, where it achieves results comparable to the state of the art.
Title
In this section, we formalize and substantiate the claims of Theorem 1. Theorem 1 has three parts, which we address in the following sections. First, in Section A.2, we show that the classifier makes progress during the early-learning phase: over the first T iterations, the gradient is well correlated with v and the accuracy on mislabeled examples increases. However, as noted in the main text, this early progress halts because the gradient terms corresponding to correctly labeled examples begin to disappear. We prove this rigorously in Section A.3, which shows that the overall magnitude of the gradient terms corresponding to correctly labeled examples shrinks over the first T iterations. Finally, in Section A.4, we prove the claimed asymptotic behavior: as t!1, gradient descent perfectly memorizes the noisy labels.
MemoryScalingPaperCameraReadyMain
Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process. We also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; we hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.
Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning
Prompt learning approaches have made waves in natural language processing by inducing better few-shot performance while they still follow a parametric-based learning paradigm; the oblivion and rote memorization problems in learning may encounter unstable generalization issues. Specifically, vanilla prompt learning may struggle to utilize atypical instances by rote during fully-supervised training or overfit shallow patterns with low-shot data.
Measures of Information Reflect Memorization Patterns
Neural networks are known to exploit spurious artifacts (or shortcuts) that co-occur with a target label, exhibiting heuristic memorization. On the other hand, networks have been shown to memorize training examples, resulting in example-level memorization. These kinds of memorization impede generalization of networks beyond their training distributions. Detecting such memorization could be challenging, often requiring researchers to curate tailored test sets. In this work, we hypothesize-- and subsequently show--that the diversity in the activation patterns of different neurons is reflective of model generalization and memorization. We quantify the diversity in the neural activations through information-theoretic measures and find support for our hypothesis in experiments spanning several natural language and vision tasks. Importantly, we discover that information organization points to the two forms of memorization, even for neural activations computed on unlabeled in-distribution examples. Lastly, we demonstrate the utility of our findings for the problem of model selection. The associated code and other resources for this work are available at https://information-measures.cs.technion.ac.il.
Emergent and Predictable Memorization in Large Language Models Stella Biderman
Memorization, or the tendency of large language models (LLMs) to output entire sequences from their training data verbatim, is a key concern for deploying language models. In particular, it is vital to minimize a model's memorization of sensitive datapoints such as those containing personal identifiable information (PII). The prevalence of such undesirable memorization can pose issues for model trainers, and may even require discarding an otherwise functional model. We therefore seek to predict which sequences will be memorized before a large model's full train-time by extrapolating the memorization behavior of lower-compute trial runs. We measure memorization in the Pythia model suite and plot scaling laws for forecasting memorization, allowing us to provide equi-compute recommendations to maximize the reliability (recall) of such predictions. We additionally provide further novel discoveries on the distribution of memorization scores across models and data.
On Memorization in Probabilistic Deep Generative Models
Gerrit J.J. van den Burg, Christopher K.I. Williams
While experimenting with the proposed memorization score on CIFAR-10 [47], we noticed that the images of automobiles shown in Figure 6 are present in the training set multiple times (with slight variation). These works are recently proposed probabilistic generative models that achieve Figure 6: Examples of images impressive performance on sample quality metrics such as the inception from the CIFAR-10 training score (IS) [35] and the Frรฉchet inception distance (FID) [36], set that were spotted in illustrations and also achieve high log likelihoods. However, the fact that we of samples from the were able to serendipitously spot images from the training set in model in recent work on generative the generated samples might suggest that some unintended memorization models. We do not know if there are other images in the presented samples that are present in the training data. Of course, spotting near duplicates of training observations is only possible because these models yield realistic samples.
On Memorization in Probabilistic Deep Generative Models
Gerrit J.J. van den Burg, Christopher K.I. Williams
Recent advances in deep generative models have led to impressive results in a variety of application domains. Motivated by the possibility that deep learning models might memorize part of the input data, there have been increased efforts to understand how memorization arises. In this work, we extend a recently proposed measure of memorization for supervised learning (Feldman, 2019) to the unsupervised density estimation problem and adapt it to be more computationally efficient. Next, we present a study that demonstrates how memorization can occur in probabilistic deep generative models such as variational autoencoders. This reveals that the form of memorization to which these models are susceptible differs fundamentally from mode collapse and overfitting. Furthermore, we show that the proposed memorization score measures a phenomenon that is not captured by commonly-used nearest neighbor tests. Finally, we discuss several strategies that can be used to limit memorization in practice. Our work thus provides a framework for understanding problematic memorization in probabilistic generative models.