Rote Learning
Beyond Frequency: The Role of Redundancy in Large Language Model Memorization
Zhang, Jie, Zhao, Qinghua, Lin, Chi-ho, Kang, Zhongfeng, Li, Lei
Memorization in large language models poses critical risks for privacy and fairness as these systems scale to billions of parameters. While previous studies established correlations between memorization and factors like token frequency and repetition patterns, we revealed distinct response patterns: frequency increases minimally impact memorized samples (e.g. 0.09) while substantially affecting non-memorized samples (e.g., 0.25), with consistency observed across model scales. Through counterfactual analysis by perturbing sample prefixes and quantifying perturbation strength through token positional changes, we demonstrate that redundancy correlates with memorization patterns. Our findings establish that: about 79% of memorized samples are low-redundancy, these low-redundancy samples exhibit 2-fold higher vulnerability than high-redundancy ones, and consequently memorized samples drop by 0.6 under perturbation while non-memorized samples drop by only 0.01, indicating that more redundant content becomes both more memorable and more fragile. These findings suggest potential redundancy-guided approaches for data preprocessing, thereby reducing privacy risks and mitigating bias to ensure fairness in model deployments.
- North America > United States > Florida > Miami-Dade County > Miami (0.05)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
On the Edge of Memorization in Diffusion Models
Buchanan, Sam, Pai, Druv, Ma, Yi, De Bortoli, Valentin
When do diffusion models reproduce their training data, and when are they able to generate samples beyond it? A practically relevant theoretical understanding of this interplay between memorization and generalization may significantly impact real-world deployments of diffusion models with respect to issues such as copyright infringement and data privacy. In this work, to disentangle the different factors that influence memorization and generalization in practical diffusion models, we introduce a scientific and mathematical "laboratory" for investigating these phenomena in diffusion models trained on fully synthetic or natural image-like structured data. Within this setting, we hypothesize that the memorization or generalization behavior of an underparameterized trained model is determined by the difference in training loss between an associated memorizing model and a generalizing model. To probe this hypothesis, we theoretically characterize a crossover point wherein the weighted training loss of a fully generalizing model becomes greater than that of an underparameterized memorizing model at a critical value of model (under)parameterization. We then demonstrate via carefully-designed experiments that the location of this crossover predicts a phase transition in diffusion models trained via gradient descent, validating our hypothesis. Ultimately, our theory enables us to analytically predict the model size at which memorization becomes predominant. Our work provides an analytically tractable and practically meaningful setting for future theoretical and empirical investigations. Code for our experiments is available at https://github.com/DruvPai/diffusion_mem_gen.
- North America > United States > California (0.04)
- Asia > China > Hong Kong (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Li, Huihan, Chen, You, Wang, Siyuan, He, Yixin, Mehrabi, Ninareh, Gupta, Rahul, Ren, Xiang
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources - local, mid-range, or long-range - based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models
Ruzzetti, Elena Sofia, Xompero, Giancarlo A., Venditti, Davide, Zanzotto, Fabio Massimo
Large Language Models (LLMs) memorize, and thus, among huge amounts of uncontrolled data, may memorize Personally Identifiable Information (PII), which should not be stored and, consequently, not leaked. In this paper, we introduce Private Memorization Editing (PME), an approach for preventing private data leakage that turns an apparent limitation, that is, the LLMs' memorization ability, into a powerful privacy defense strategy. While attacks against LLMs have been performed exploiting previous knowledge regarding their training data, our approach aims to exploit the same kind of knowledge in order to make a model more robust. We detect a memorized PII and then mitigate the memorization of PII by editing a model knowledge of its training data. We verify that our procedure does not affect the underlying language model while making it more robust against privacy Training Data Extraction attacks. We demonstrate that PME can effectively reduce the number of leaked PII in a number of configurations, in some cases even reducing the accuracy of the privacy attacks to zero.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (10 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models
Ramakrishnan, Badrinath, Balaji, Akshaya
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2, we demonstrate that fine-tuning with repeated sensitive data increases privacy leakage rates from baseline levels of 0-5% to 60-75%, representing a 64.2% average increase across tested models. We propose and rigorously evaluate four complementary privacy protection methods: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Our experimental results show that these techniques can reduce data leakage to 0% while maintaining 94.7% of original model utility.
- Research Report > New Finding (0.89)
- Research Report > Experimental Study (0.54)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.74)
- North America > United States (1.00)
- Europe > Moldova (1.00)
- Asia > India (1.00)
- (3 more...)
- Research Report (0.67)
- Press Release (0.45)
- Leisure & Entertainment > Sports > Soccer (1.00)
- Leisure & Entertainment > Sports > Hockey (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- (18 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
From Teacher to Student: Tracking Memorization Through Model Distillation
Large language models (LLMs) are known to memorize parts of their training data, raising important concerns around privacy and security. While previous research has focused on studying memorization in pre-trained models, much less is known about how knowledge distillation (KD) affects memorization.In this study, we explore how different KD methods influence the memorization of fine-tuned task data when a large teacher model is distilled into smaller student variants.This study demonstrates that distilling a larger teacher model, fine-tuned on a dataset, into a smaller variant not only lowers computational costs and model size but also significantly reduces the memorization risks compared to standard fine-tuning approaches.
- Asia > India > Uttarakhand > Roorkee (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Demystifying Foreground-Background Memorization in Diffusion Models
Di, Jimmy Z., Lu, Yiwei, Yu, Yaoliang, Kamath, Gautam, Dziedzic, Adam, Boenisch, Franziska
Diffusion models (DMs) memorize training images and can reproduce near-duplicates during generation. Current detection methods identify verbatim memorization but fail to capture two critical aspects: quantifying partial memorization occurring in small image regions, and memorization patterns beyond specific prompt-image pairs. To address these limitations, we propose Foreground Background Memorization ( FB-Mem), a novel segmentation-based metric that classifies and quantifies memorized regions within generated images. Our method reveals that memorization is more pervasive than previously understood: (1) individual generations from single prompts may be linked to clusters of similar training images, revealing complex memorization patterns that extend beyond one-to-one correspondences; and (2) existing model-level mitigation methods, such as neuron deactivation and pruning, fail to eliminate local memorization, which persists particularly in foreground regions. Our work establishes an effective framework for measuring memorization in diffusion models, demonstrates the inadequacy of current mitigation approaches, and proposes a stronger mitigation method using a clustering approach.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Ontario (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.52)
- Information Technology > Artificial Intelligence > Natural Language > Generation (0.41)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.41)
Appendix of " Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning "
T ( x) = [CLS]x It was [MASK]. PLM to extract the label-related words from the whole unlabeled training corpus. We report the hyper-parameters in Table 2. Most of the hyper-parameters are the default parameters Thus, we provide insight into the effect of β, k and λ on the final results. We think the model may require more reference when there is no data for training. We will leave the engineering optimization about retrieval speed in our future work.
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.41)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.32)