Economics of Sourcing Human Data

Santy, Sebastin, Bhattacharya, Prasanta, Ribeiro, Manoel Horta, Allen, Kelsey, Oh, Sewoong

Feb-11-2025–arXiv.org Artificial Intelligence

Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content--it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors' intrinsic motivations--rather than relying solely on external incentives--can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.

large language model, machine learning, motivation, (18 more...)

arXiv.org Artificial Intelligence

Feb-11-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.14)

Genre:
- Research Report (0.50)

Industry:
- Automobiles & Trucks (0.67)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
- Information Technology (1.00)
- Leisure & Entertainment > Games
  - Computer Games (1.00)
- Transportation (0.93)

Technology:
- Information Technology
  - Artificial Intelligence
    - Cognitive Science (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (0.94)
    - Natural Language > Large Language Model (0.90)
    - Representation & Reasoning (1.00)
    - Robots (1.00)
  - Communications > Social Media (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found