Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Freeman, Joshua, Rippe, Chloe, Debenedetti, Edoardo, Andriushchenko, Maksym

Dec-9-2024–arXiv.org Artificial Intelligence

Our work aims to measure the propensity of OpenAI's LLMs to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. We discover that both GPT and Claude models use refusal training and output filters to prevent verbatim output of the memorized articles. We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization. Our findings have practical implications for training: more attention must be placed on preventing verbatim memorization in very large models.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Dec-9-2024

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States (0.93)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Information Technology > Security & Privacy (1.00)
- Law
  - Intellectual Property & Technology Law (1.00)
  - Litigation (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Memory-Based Learning > Rote Learning (1.00)
    - Neural Networks > Deep Learning
      - Generative AI (0.94)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)