Sequence-Level Analysis of Leakage Risk of Training Data in Large Language Models

Dec-15-2024–arXiv.org Artificial Intelligence

This work advocates for the use of sequence level probabilities for quantifying the risk of extraction training data from Large Language Models (LLMs) as they provide much finer-grained information than has been previously obtained. We re-analyze the effects of decoding schemes, model-size, prefix length, partial sequence leakages, and token positions to uncover new insights that have were not possible in prior work due to their choice of metrics. We perform this study on two pre-trained models, LLaMa and OPT, trained on the Common Crawl and Pile respectively. We discover that 1) Extraction rate, the predominant metric used in prior quantification work, underestimates the threat of leakage of training data in randomized LLMs by as much as 2.14X. The insights gained from our analysis show that it is important to look at leakage of training data on a per-sequence basis. There have been several studies documenting leakage of training data from Large Language Models (LLMs), starting with Carlini et al. (2021).

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Dec-15-2024

arXiv.org PDF

Add feedback

Country:
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre:
- Research Report (0.64)

Industry:
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found