Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Chang, Kent K., Cramer, Mackenzie, Soni, Sandeep, Bamman, David

Oct-20-2023–arXiv.org Artificial Intelligence

In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.

computational linguistic, language model, memorization, (15 more...)

arXiv.org Artificial Intelligence

Oct-20-2023

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - Dominican Republic (0.04)
  - Bermuda (0.04)
  - United States
    - Texas (0.04)
    - Washington > King County
      - Seattle (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Illinois > Cook County
      - Chicago (0.04)
    - California > Alameda County
      - Berkeley (0.04)
  - Puerto Rico > Peñuelas
    - Peñuelas (0.04)
- Europe
  - Russia (0.04)
  - France (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
- Asia
  - Russia (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.68)

Industry:
- Leisure & Entertainment (1.00)
- Media > Film (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found