Memory-Based Learning
Logarithmic Width Suffices for Robust Memorization
Egosi, Amitsour, Yehudai, Gilad, Shamir, Ohad
The ability of neural networks to memorize labeled datasets is a central question in the study of their expressive power. Given some input domain X, output domain Y, and dataset size N, we say that a network memorizes datasets of size N, if for every labeled dataset D X Y, where |D| = N, we can find parameters such that the resulting network f: X Y perfectly fits the dataset (that is, f(x) = y for every labeled pair (x, y) D). The main question here - which has been studied in many recent works (see Section 2 for details) - is to characterize the size/architecture of the networks that have enough expressive power to memorize any dataset of a given size N. However, merely fitting a given dataset is not enough for most tasks, and a desirable property for trained networks is that they remain robust to noise and minor modifications in the dataset. This robustness property allows neural networks to generalize from observed data points to unseen data points. Furthermore, neural networks have been shown to be vulnerable to adversarial attacks [Szegedy et al., 2013, Carlini and Wagner, 2017, Papernot et al., 2017, Athalye et al., 2018] in the form of slightly perturbed examples, where (in the context of visual data) the perturbation is often imperceptible to the human eye. Moreover, existing constructions of memorizing networks are often quite delicate, and not at all robust to such perturbations. This motivates the question of characterizing the networks that have enough capacity to robustly memorize a dataset.
Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM
Given the rapid development of Legal AI, a lot of attention has been paid to one of the most important legal AI tasks--similar case retrieval, especially with language models to use. In our paper, however, we try to improve the ranking performance of current models from the perspective of learning to rank instead of language models. Specifically, we conduct experiments using a pairwise method--RankSVM as the classifier to substitute a fully connected layer, combined with commonly used language models on similar case retrieval datasets LeCaRDv1 and LeCaRDv2. We finally come to the conclusion that RankSVM could generally help improve the retrieval performance on the LeCaRDv1 and LeCaRDv2 datasets compared with original classifiers by optimizing the precise ranking. It could also help mitigate overfitting owing to class imbalance. Our code is available in https://github.com/liuyuqi123study/RankSVM_for_SLR
Counterfactual Memorization in Neural Language Models Chiyuan Zhang Daphne Ippolito Katherine Lee Google Research Carnegie Mellon University Google DeepMind
Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data. Understanding this memorization is important in real world applications and also from a learningtheoretical perspective. An open question in previous studies of language model memorization is how to filter out "common" memorization. In fact, most memorization criteria strongly correlate with the number of occurrences in the training set, capturing memorized familiar phrases, public knowledge, templated texts, or other repeated data. We formulate a notion of counterfactual memorization which characterizes how a model's predictions change if a particular document is omitted during training. We identify and study counterfactually-memorized training examples in standard text datasets. We estimate the influence of each memorized training example on the validation set and on generated texts, showing how this can provide direct evidence of the source of memorization at test time.
On Memorization in Probabilistic Deep Generative Models
Gerrit J.J. van den Burg, Christopher K.I. Williams
While experimenting with the proposed memorization score on CIFAR-10 [47], we noticed that the images of automobiles shown in Figure 6 are present in the training set multiple times (with slight variation). These works are recently proposed probabilistic generative models that achieve Figure 6: Examples of images impressive performance on sample quality metrics such as the inception from the CIFAR-10 training score (IS) [35] and the Frรฉchet inception distance (FID) [36], set that were spotted in illustrations and also achieve high log likelihoods. However, the fact that we of samples from the were able to serendipitously spot images from the training set in model in recent work on generative the generated samples might suggest that some unintended memorization models. We do not know if there are other images in the presented samples that are present in the training data. Of course, spotting near duplicates of training observations is only possible because these models yield realistic samples.
On Memorization in Probabilistic Deep Generative Models
Gerrit J.J. van den Burg, Christopher K.I. Williams
Recent advances in deep generative models have led to impressive results in a variety of application domains. Motivated by the possibility that deep learning models might memorize part of the input data, there have been increased efforts to understand how memorization arises. In this work, we extend a recently proposed measure of memorization for supervised learning (Feldman, 2019) to the unsupervised density estimation problem and adapt it to be more computationally efficient. Next, we present a study that demonstrates how memorization can occur in probabilistic deep generative models such as variational autoencoders. This reveals that the form of memorization to which these models are susceptible differs fundamentally from mode collapse and overfitting. Furthermore, we show that the proposed memorization score measures a phenomenon that is not captured by commonly-used nearest neighbor tests. Finally, we discuss several strategies that can be used to limit memorization in practice. Our work thus provides a framework for understanding problematic memorization in probabilistic generative models.
MemoryScalingPaperCameraReadyMain
Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process. We also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; we hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.
Emergent and Predictable Memorization in Large Language Models Stella Biderman
Memorization, or the tendency of large language models (LLMs) to output entire sequences from their training data verbatim, is a key concern for deploying language models. In particular, it is vital to minimize a model's memorization of sensitive datapoints such as those containing personal identifiable information (PII). The prevalence of such undesirable memorization can pose issues for model trainers, and may even require discarding an otherwise functional model. We therefore seek to predict which sequences will be memorized before a large model's full train-time by extrapolating the memorization behavior of lower-compute trial runs. We measure memorization in the Pythia model suite and plot scaling laws for forecasting memorization, allowing us to provide equi-compute recommendations to maximize the reliability (recall) of such predictions. We additionally provide further novel discoveries on the distribution of memorization scores across models and data.
Appendix of " Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning "
In this section, we introduce the datasets as shown in Table 1 and list the templates we use in experiments as follows. T (x) = [CLS]x It was [MASK]. We set Verbalizer: (great/terrible) (positive/negative) for SST-2 MR and CR. For the Yahoo dataset, we assign the Verbalizer following the original labels. FewNERD, SemEval and TACRED are datasets for information extraction, which require inserting the entity into the template.
Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning
Prompt learning approaches have made waves in natural language processing by inducing better few-shot performance while they still follow a parametric-based learning paradigm; the oblivion and rote memorization problems in learning may encounter unstable generalization issues. Specifically, vanilla prompt learning may struggle to utilize atypical instances by rote during fully-supervised training or overfit shallow patterns with low-shot data.
The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification
Been Kim, Cynthia Rudin, Julie A. Shah
We present the Bayesian Case Model (BCM), a general framework for Bayesian case-based reasoning (CBR) and prototype classification and clustering. BCM brings the intuitive power of CBR to a Bayesian generative framework. The BCM learns prototypes, the "quintessential" observations that best represent clusters in a dataset, by performing joint inference on cluster labels, prototypes and important features. Simultaneously, BCM pursues sparsity by learning subspaces, the sets of features that play important roles in the characterization of the prototypes. The prototype and subspace representation provides quantitative benefits in interpretability while preserving classification accuracy. Human subject experiments verify statistically significant improvements to participants' understanding when using explanations produced by BCM, compared to those given by prior art.