AI Teams Contend With Synthetic Data's Jekyll/Hyde Roles

Aug-26-2025, 16:03:13 GMT–Communications of the ACM

Training models with synthetic data presents both a danger and a boon to artificial intelligence (AI). While some groups have aggressively pursued the use of model-generated data to train successors for greater accuracy and generalization, others have warned about the risks posed by AI ingesting its own output. The two views are not at odds. The question is when and where things go wrong. On the negative side, a flurry of papers published since 2021 have argued that, as the datasets used to pretrain foundation models incorporate more and more auto-generated data mined from the Internet, performance degrades and the models start to "unlearn" skills.

large language model, machine learning, natural language, (16 more...)

Communications of the ACM

Aug-26-2025, 16:03:13 GMT

Journals Web Page

Add feedback

Country:
- Europe > Germany
  - Brandenburg > Potsdam (0.05)
- Asia
  - China (0.05)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)

Genre:
- Research Report (0.35)

Industry:
- Education (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.49)
  - Machine Learning > Neural Networks
    - Deep Learning (0.50)