Interrogating LLM design under a fair learning doctrine

Wei, Johnny Tian-Zheng, Wang, Maggie, Godbole, Ameya, Choi, Jonathan H., Jia, Robin

Feb-22-2025–arXiv.org Artificial Intelligence

The current discourse on large language models (LLMs) and copyright largely takes a "behavioral" perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary "structural" perspective and shift our focus to how LLMs are trained. We operationalize a notion of "fair learning" by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Feb-22-2025

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- Europe (1.00)
- North America > United States
  - California > Los Angeles County > Los Angeles (0.14)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Government > Regional Government
  - North America Government > United States Government (0.92)
- Information Technology (1.00)
- Law
  - Intellectual Property & Technology Law (1.00)
  - Litigation (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)