AITopics

2502.05087

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Education > Curriculum > Subject-Specific Education (0.67)
Health & Medicine > Health Care Technology > Medical Record (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

arXiv.org Machine LearningFeb-5-2025

Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

Wu, Yu-Han, Marion, Pierre, Biau, Gérard, Boyer, Claire

Denoising score matching plays a pivotal role in the performance of diffusion-based generative models. However, the empirical optimal score--the exact solution to the denoising score matching--leads to memorization, where generated samples replicate the training data. Yet, in practice, only a moderate degree of memorization is observed, even without explicit regularization. In this paper, we investigate this phenomenon by uncovering an implicit regularization mechanism driven by large learning rates. Specifically, we show that in the small-noise regime, the empirical optimal score exhibits high irregularity. We then prove that, when trained by stochastic gradient descent with a large enough learning rate, neural networks cannot stably converge to a local minimum with arbitrarily small excess risk. Consequently, the learned score cannot be arbitrarily close to the empirical optimal score, thereby mitigating memorization. To make the analysis tractable, we consider one-dimensional data and two-layer neural networks. Experiments validate the crucial role of the learning rate in preventing memorization, even beyond the one-dimensional setting.

artificial intelligence, machine learning, memorization, (13 more...)

arXiv.org Machine Learning

2502.03435

Country:

Europe > France (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

arXiv.org Artificial IntelligenceFeb-3-2025

Skewed Memorization in Large Language Models: Quantification and Decomposition

Li, Hao, Huang, Di, Wang, Ziyu, Rahmani, Amir M.

Memorization in Large Language Models (LLMs) poses privacy and security risks, as models may unintentionally reproduce sensitive or copyrighted data. Existing analyses focus on average-case scenarios, often neglecting the highly skewed distribution of memorization. This paper examines memorization in LLM supervised fine-tuning (SFT), exploring its relationships with training duration, dataset size, and inter-sample similarity. By analyzing memorization probabilities over sequence lengths, we link this skewness to the token generation process, offering insights for estimating memorization and comparing it to established metrics. Through theoretical analysis and empirical evaluation, we provide a comprehensive understanding of memorization behaviors and propose strategies to detect and mitigate risks, contributing to more privacy-preserving LLMs.

large language model, machine learning, natural language, (17 more...)

2502.01187

Country: North America > United States > California > Orange County > Irvine (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

Dankers, Verna, Raunak, Vikas

Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation

arXiv.org Artificial IntelligenceFeb-3-2025

In this work, we explore how instance-level memorization in the teacher Neural Machine Translation (NMT) model gets inherited by the student model in sequence-level knowledge distillation (SeqKD). We find that despite not directly seeing the original training data, students memorize more than baseline models (models of the same size, trained on the original data) -- 3.4% for exact matches and 57% for extractive memorization -- and show increased hallucination rates. Further, under this SeqKD setting, we also characterize how students behave on specific training data subgroups, such as subgroups with low quality and specific counterfactual memorization (CM) scores, and find that students exhibit amplified denoising on low-quality subgroups. Finally, we propose a modification to SeqKD named Adaptive-SeqKD, which intervenes in SeqKD to reduce memorization and hallucinations. Overall, we recommend caution when applying SeqKD: students inherit both their teachers' superior performance and their fault modes, thereby requiring active monitoring.

artificial intelligence, machine learning, natural language, (16 more...)

2502.01491

Country:

North America > Trinidad and Tobago (0.04)
North America > Canada (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry: Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

Neural Information Processing SystemsJan-27-2025, 12:50:16 GMT

Reviews: Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

The paper investigates the problem of expressiveness in neural networks w.r.t. The authors also show an upper bound for classification, a corollary of which is that a three hidden layer network with hidden layers of sized 2k-2k-4k can perfectly classify ImageNet. Moreover, they show that if the overall sum of hidden nodes in a ResNet is of order N/d_x, where d_x is the input dimension then again the network can perfectly realize the data. Lastly, an analysis is given showing batch SGD that is initialized close to a global minimum will come close to a point with value significantly smaller than the loss in the initialization (though a convergence guarantee could not be given). The paper is clear and easy to follow for the most part, and conveys a feeling that the authors did their best to make the analysis as thorough and exhausting as possible, providing results for various settings.

memorization capacity, powerful memorizer, small relu network, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.40)

Neural Information Processing SystemsJan-27-2025, 12:50:07 GMT

Reviews: Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

The topic is timely, and the results would be of interest to a wide audience. The reviewers found the paper well written and were also satisfied with the authors response. However, please do take the time to address their comments and revise what is necessary in the final version.

memorization capacity, powerful memorizer, small relu network, (1 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.57)

Baptista, Ricardo, Dasgupta, Agnimitra, Kovachki, Nikola B., Oberai, Assad, Stuart, Andrew M.

Memorization and Regularization in Generative Diffusion Models

arXiv.org Artificial IntelligenceJan-27-2025

Diffusion models have emerged as a powerful framework for generative modeling. At the heart of the methodology is score matching: learning gradients of families of log-densities for noisy versions of the data distribution at different scales. When the loss function adopted in score matching is evaluated using empirical data, rather than the population loss, the minimizer corresponds to the score of a time-dependent Gaussian mixture. However, use of this analytically tractable minimizer leads to data memorization: in both unconditioned and conditioned settings, the generative model returns the training samples. This paper contains an analysis of the dynamical mechanism underlying memorization. The analysis highlights the need for regularization to avoid reproducing the analytically tractable minimizer; and, in so doing, lays the foundations for a principled understanding of how to regularize. Numerical experiments investigate the properties of: (i) Tikhonov regularization; (ii) regularization designed to promote asymptotic consistency; and (iii) regularizations induced by under-parameterization of a neural network or by early stopping when training a neural network. These experiments are evaluated in the context of memorization, and directions for future development of regularization are highlighted.

artificial intelligence, machine learning, memorization and regularization, (16 more...)

2501.15785

Country:

North America > United States > California (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.63)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

Neural Information Processing SystemsJan-25-2025, 04:36:36 GMT

Review for NeurIPS paper: Neural Networks Learning and Memorization with (almost) no Over-Parameterization

Weaknesses: One of my concerns is the rigorousness of the paper. A key lemma, namely Lemma 12 in the supplementary material is only given with a proof sketch. Moreover, in the proof sketch, how the authors handle the general M-decent activation functions is discussed very ambiguously. This makes the results for ReLU activation function particularly questionable. The significance and novelty of this paper compared with the existing results are also not fully demonstrated. It is claimed in this paper that a tight analysis is given on the convergence of NTK to its expectations.

neural network learning and memorization, neurips paper, over-parameterization, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.40)

Neural Information Processing SystemsJan-25-2025, 04:36:28 GMT

Review for NeurIPS paper: Neural Networks Learning and Memorization with (almost) no Over-Parameterization

This paper studies optimization in the NTK regime, further improving the best prior width bounds for random data (I believe Oymak-Soltanolkotabi were the prior best). The reviewers and I were all favorable, and I look forward to seeing this paper appear, and support the authors in further investigations. Relatedly, this point was not sufficiently handled in the rebuttal, despite the rebuttal using less than half a page. Please consider such things in the future. One is by Roman Vershynin, and I believe Sebastien Bubeck and colleagues also had a paper on the "Baum" problem.

neural network learning and memorization, neurips paper, over-parameterization, (1 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.40)

Ketha, Simran, Ramaswamy, Venkatakrishnan

Decoding Generalization from Memorization in Deep Neural Networks

arXiv.org Artificial IntelligenceJan-24-2025

Overparameterized Deep Neural Networks that generalize well have been key to the dramatic success of Deep Learning in recent years. The reasons for their remarkable ability to generalize are not well understood yet. It has also been known that deep networks possess the ability to memorize training data, as evidenced by perfect or high training accuracies on models trained with corrupted data that have class labels shuffled to varying degrees. Concomitantly, such models are known to generalize poorly, i.e. they suffer from poor test accuracies, due to which it is thought that the act of memorizing substantially degrades the ability to generalize. It has, however, been unclear why the poor generalization that accompanies such memorization, comes about. One possibility is that in the process of training with corrupted data, the layers of the network irretrievably reorganize their representations in a manner that makes generalization difficult. The other possibility is that the network retains significant ability to generalize, but the trained network somehow chooses to readout in a manner that is detrimental to generalization. Here, we provide evidence for the latter possibility by demonstrating, empirically, that such models possess information in their representations for substantially improved generalization, even in the face of memorization. Furthermore, such generalization abilities can be easily decoded from the internals of the trained model, and we build a technique to do so from the outputs of specific layers of the network. We demonstrate results on multiple models trained with a number of standard datasets.

accuracy, artificial intelligence, machine learning, (11 more...)

2501.14687

Country: Asia > India (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.81)