AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

Neural Information Processing SystemsMay-27-2025, 03:48:06 GMT

Rethinking LLM Memorization through the Lens of Adversarial Compression

Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage. One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information. The answer hinges, to a large degree, on \emph{how we define memorization.} In this work, we propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs. A given string from the training data is considered memorized if it can be elicited by a prompt (much) shorter than the string itself---in other words, if these strings can be compressed'' with the model by computing adversarial prompts of fewer tokens.

large language model, machine learning, natural language, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

arXiv.org Artificial IntelligenceMay-27-2025

Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization

Arvin, Chuck

As large language models (LLMs) continue to advance in capabilities, it is essential to assess how they perform on established benchmarks. In this study, we present a suite of experiments to assess the performance of modern LLMs (ranging from 3B to 90B+ parameters) on CaseHOLD, a legal benchmark dataset for identifying case holdings. Our experiments demonstrate scaling effects - performance on this task improves with model size, with more capable models like GPT4o and AmazonNovaPro achieving macro F1 scores of 0.744 and 0.720 respectively. These scores are competitive with the best published results on this dataset, and do not require any technically sophisticated model training, fine-tuning or few-shot prompting. To ensure that these strong results are not due to memorization of judicial opinions contained in the training data, we develop and utilize a novel citation anonymization test that preserves semantic meaning while ensuring case names and citations are fictitious. Models maintain strong performance under these conditions (macro F1 of 0.728), suggesting the performance is not due to rote memorization. These findings demonstrate both the promise and current limitations of LLMs for legal tasks with important implications for the development and measurement of automated legal analytics and legal benchmarks.

large language model, machine learning, natural language, (17 more...)

2505.02172

Country: North America > United States > California (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Law > Civil Rights & Constitutional Law (0.47)
Government > Regional Government > North America Government > United States Government (0.47)
Law > Litigation (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.47)

Neural Information Processing SystemsMay-26-2025, 21:03:09 GMT

Measuring Dejavu Memorization Efficiently

Recent research has shown that representation learning models may accidentally memorize their training data. For example, the déjà vu method shows that for certain representation learning models and training images, it is sometimes possible to correctly predict the foreground label given only the representation of he background – better than through dataset-level correlations. However, their measurement method requires training two models – one to estimate dataset-level correlations and the other to estimate memorization. This multiple model setup becomes infeasible for large open-source models. In this work, we propose alter native simple methods to estimate dataset-level correlations, and show that these can be used to approximate an off-the-shelf model's memorization ability without any retraining.

artificial intelligence, machine learning, measuring dejavu memorization efficiently, (3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

Neural Information Processing SystemsMay-26-2025, 19:43:41 GMT

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subsets of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set.

large language model, machine learning, natural language, (5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.73)

Kozal, Jędrzej, Wasilewski, Jan, Ashrafee, Alif, Krawczyk, Bartosz, Woźniak, Michał

What is the role of memorization in Continual Learning?

arXiv.org Artificial IntelligenceMay-26-2025

Memorization impacts the performance of deep learning algorithms. Prior works have studied memorization primarily in the context of generalization and privacy. This work studies the memorization effect on incremental learning scenarios. Forgetting prevention and memorization seem similar. However, one should discuss their differences. We designed extensive experiments to evaluate the impact of memorization on continual learning. We clarified that learning examples with high memorization scores are forgotten faster than regular samples. Our findings also indicated that memorization is necessary to achieve the highest performance. However, at low memory regimes, forgetting regular samples is more important. We showed that the importance of a high-memorization score sample rises with an increase in the buffer size. We introduced a memorization proxy and employed it in the buffer policy problem to showcase how memorization could be used during incremental training. We demonstrated that including samples with a higher proxy memorization score is beneficial when the buffer size is large.

artificial intelligence, machine learning, memorization score, (16 more...)

2505.17664

Country:

North America > United States (0.14)
Europe > Poland (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology > Security & Privacy (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

Selvam, Sriram, Ghosh, Anneswa

PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs

arXiv.org Artificial IntelligenceMay-20-2025

The memorization of sensitive and personally identifiable information (PII) by large language models (LLMs) poses growing privacy risks as models scale and are increasingly deployed in real-world applications. Existing efforts to study sensitive and PII data memorization and develop mitigation strategies are hampered by the absence of comprehensive, realistic, and ethically sourced datasets reflecting the diversity of sensitive information found on the web. We introduce PANORAMA - Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis, a large-scale synthetic corpus of 384,789 samples derived from 9,674 synthetic profiles designed to closely emulate the distribution, variety, and context of PII and sensitive data as it naturally occurs in online environments. Our data generation pipeline begins with the construction of internally consistent, multi-attribute human profiles using constrained selection to reflect real-world demographics such as education, health attributes, financial status, etc. Using a combination of zero-shot prompting and OpenAI o3-mini, we generate diverse content types - including wiki-style articles, social media posts, forum discussions, online reviews, comments, and marketplace listings - each embedding realistic, contextually appropriate PII and other sensitive information. We validate the utility of PANORAMA by fine-tuning the Mistral-7B model on 1x, 5x, 10x, and 25x data replication rates with a subset of data and measure PII memorization rates - revealing not only consistent increases with repetition but also variation across content types, highlighting PANORAMA's ability to model how memorization risks differ by context. Our dataset and code are publicly available, providing a much-needed resource for privacy risk assessment, model auditing, and the development of privacy-preserving LLMs.

large language model, machine learning, natural language, (18 more...)

2505.12238

Country: Europe (0.46)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

He, Chengyang, Liu, Xu, Camps, Gadiel Sznaier, Sartoretti, Guillaume, Schwager, Mac

Demystifying Diffusion Policies: Action Memorization and Simple Lookup Table Alternatives

arXiv.org Artificial IntelligenceMay-12-2025

Diffusion policies have demonstrated remarkable dexterity and robustness in intricate, high-dimensional robot manipulation tasks, while training from a small number of demonstrations. However, the reason for this performance remains a mystery. In this paper, we offer a surprising hypothesis: diffusion policies essentially memorize an action lookup table -- and this is beneficial. We posit that, at runtime, diffusion policies find the closest training image to the test image in a latent space, and recall the associated training action sequence, offering reactivity without the need for action generalization. This is effective in the sparse data regime, where there is not enough data density for the model to learn action generalization. We support this claim with systematic empirical evidence. Even when conditioned on wildly out of distribution (OOD) images of cats and dogs, the Diffusion Policy still outputs an action sequence from the training data. With this insight, we propose a simple policy, the Action Lookup Table (ALT), as a lightweight alternative to the Diffusion Policy. Our ALT policy uses a contrastive image encoder as a hash function to index the closest corresponding training action sequence, explicitly performing the computation that the Diffusion Policy implicitly learns. We show empirically that for relatively small datasets, ALT matches the performance of a diffusion model, while requiring only 0.0034 of the inference time and 0.0085 of the memory footprint, allowing for much faster closed-loop inference with resource constrained robots. We also train our ALT policy to give an explicit OOD flag when the distance between the runtime image is too far in the latent space from the training images, giving a simple but effective runtime monitor. More information can be found at: https://stanfordmsl.github.io/alt/.

artificial intelligence, diffusion model, machine learning, (16 more...)

2505.05787

Country:

Asia (0.28)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.43)

arXiv.org Machine LearningMay-7-2025

Resolving Memorization in Empirical Diffusion Model for Manifold Data in High-Dimensional Spaces

Lyu, Yang, Qian, Yuchun, Nguyen, Tan Minh, Tong, Xin T.

Diffusion models is a popular computational tool to generate new data samples. It utilizes a forward diffusion process that add noise to the data distribution and then use a reverse process to remove noises to produce samples from the data distribution. However, when the empirical data distribution consists of $n$ data point, using the empirical diffusion model will necessarily produce one of the existing data points. This is often referred to as the memorization effect, which is usually resolved by sophisticated machine learning procedures in the current literature. This work shows that the memorization problem can be resolved by a simple inertia update step at the end of the empirical diffusion model simulation. Our inertial diffusion model requires only the empirical diffusion model score function and it does not require any further training. We show that choosing the inertia diffusion model sample distribution is an $O\left(n^{-\frac{2}{d+4}}\right)$ Wasserstein-1 approximation of a data distribution lying on a $C^2$ manifold of dimension $d$. Since this estimate is significant smaller the Wasserstein1 distance between population and empirical distributions, it rigorously shows the inertial diffusion model produces new data samples. Remarkably, this upper bound is completely free of the ambient space dimension, since there is no training involved. Our analysis utilizes the fact that the inertial diffusion model samples are approximately distributed as the Gaussian kernel density estimator on the manifold. This reveals an interesting connection between diffusion model and manifold learning.

artificial intelligence, kde, machine learning, (16 more...)

arXiv.org Machine Learning

2505.02508

Country:

Asia > Singapore (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Djiré, Albérick Euraste, Kaboré, Abdoul Kader, Barr, Earl T., Klein, Jacques, Bissyandé, Tegawendé F.

Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis

arXiv.org Artificial IntelligenceMay-7-2025

While Large Language Models (LLMs) achieve remarkable performance through training on massive datasets, they can exhibit concerning behaviors such as verbatim reproduction of training data rather than true generalization. This memorization phenomenon raises significant concerns about data privacy, intellectual property rights, and the reliability of model evaluations. This paper introduces PEARL, a novel approach for detecting memorization in LLMs. PEARL assesses how sensitive an LLM's performance is to input perturbations, enabling memorization detection without requiring access to the model's internals. We investigate how input perturbations affect the consistency of outputs, enabling us to distinguish between true generalization and memorization. Our findings, following extensive experiments on the Pythia open model, provide a robust framework for identifying when the model simply regurgitates learned information. Applied on the GPT 4o models, the PEARL framework not only identified cases of memorization of classic texts from the Bible or common code from HumanEval but also demonstrated that it can provide supporting evidence that some data, such as from the New York Times news articles, were likely part of the training data of a given model.

large language model, machine learning, natural language, (17 more...)

2505.03019

Country:

Europe (1.00)
Asia (0.93)
North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government > Regional Government > Europe Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)