Memory-Based Learning
A title
Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for realworld, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the "source" and "target" modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.
For Better or for Worse, Transformers Seek Patterns for Memorization
Memorization in language models is a critical yet poorly understood phenomenon. In this work, we investigate memorization in transformer-based language models by analyzing their memorization dynamics during training over multiple epochs. We find that memorization is neither a constant accumulation of sequences nor simply dictated by the recency of exposure to these sequences. Instead, much like generalization, memorization appears to be driven by pattern recognition. Tracking memorization dynamics in mixed datasets, we observe that models memorize different sub-datasets in distinct bursts, suggesting that each subset is associated with unique underlying patterns, and that the model prefers to learn these patterns in a consistent order. We also find that easily learnable patterns tend to support generalization on unseen data, while more complex patterns do not. Furthermore, in datasets with weak or absent patterns, larger models may delay memorization relative to smaller ones, a behavior we term overthinking. Our results show that the subset of sequences memorized by a model over time is not arbitrary, and give insights into the internal processes a model goes through during training.
Impact of Layer Norm on Memorization and Generalization in Transformers
Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, PreLayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Preand Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers2.
Analyzing the Power of Chain of Thought through Memorization Capabilities
It has been shown that the chain of thought (CoT) can enhance the power of large language models (LLMs) to solve certain mathematical reasoning problems. However, the capacity of CoT is still not fully explored. As an important instance, the following basic question has not yet been answered: Does CoT expand the capability of transformers across all reasoning tasks? We demonstrate that reasoning with transformers is essentially a memorization problem for reasoning datasets.
ACloser Look at Model Collapse: From a Generalization-to-Memorization Perspective
The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse--a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.
Memorization in Graph Neural Networks
Deep neural networks (DNNs) have been shown to memorize their training data, but similar analyses for graph neural networks (GNNs) remain under-explored. We introduce NCMemo(Node Classification Memorization), the first framework to quantify label memorization in semi-supervised node classification. We establish an inverse relationship between memorization and graph homophily, i.e., the tendency of connected nodes to share labels or features. Lower homophily significantly increases memorization, indicating that GNNs rely on label memorization when learning less homophilic graphs. We then analyze GNN training dynamics and find that increased memorization in low-homophily graphs is tightly coupled to GNNs' implicit bias toward using graph structure.
Prompt Tuning Transformers for Data Memorization
Prompt tuning has emerged as a powerful parameter-efficient fine-tuning technique, allowing large pretrained Transformers to adapt to downstream tasks by optimizing a small set of prompt embeddings. Despite its empirical success, the extent to which prompt tuning can memorize data remains poorly understood. In this paper, we provide both theoretical and empirical analyses of data memorization ability of prompt-tuned Transformers. Building on recent theoretical frameworks, we derive an upper bound on the required prompt length for exact memorization of finite datasets and establish a trade-off between prompt length and the number of autoregressive generation steps. Specifically, we show that a constant-size Transformer can memorize ninput-output pairs with prompts of length O( nN), where N denotes the sequence length. Empirical results further demonstrate that prompt-tuned, randomly initialized Transformers are able to effectively memorize finite datasets. These models also capture the intrinsic low-rank structure of the data, leading to a reduction in the required prompt length. Finally, we analyze how the initialization of the Transformer backbone affects the performance of prompt tuning. Our findings provide new insights into the expressivity, efficiency, and underlying mechanisms of prompt tuning, bridging theoretical memorization limits with observed empirical behaviors.
Prompt Tuning Transformers for Data Memorization
Prompt tuning has emerged as a powerful parameter-efficient fine-tuning technique, allowing large pretrained Transformers to adapt to downstream tasks by optimizing a small set of prompt embeddings. Despite its empirical success, the extent to which prompt tuning can memorize data remains poorly understood. In this paper, we provide both theoretical and empirical analyses of data memorization ability of prompt-tuned Transformers. Building on recent theoretical frameworks, we derive an upper bound on the required prompt length for exact memorization of finite datasets and establish a trade-off between prompt length and the number of autoregressive generation steps. Specifically, we show that a constant-size Transformer can memorize $n$ input-output pairs with prompts of length $\tilde{O}(\sqrt{nN})$, where $N$ denotes the sequence length. Empirical results further demonstrate that prompt-tuned, randomly initialized Transformers are able to effectively memorize finite datasets. These models also capture the intrinsic low-rank structure of the data, leading to a reduction in the required prompt length. Finally, we analyze how the initialization of the Transformer backbone affects the performance of prompt tuning. Our findings provide new insights into the expressivity, efficiency, and underlying mechanisms of prompt tuning, bridging theoretical memorization limits with observed empirical behaviors.
Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise
Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.