AI Teams Contend With Synthetic Data's Jekyll/Hyde Roles

Communications of the ACM 

Training models with synthetic data presents both a danger and a boon to artificial intelligence (AI). While some groups have aggressively pursued the use of model-generated data to train successors for greater accuracy and generalization, others have warned about the risks posed by AI ingesting its own output. The two views are not at odds. The question is when and where things go wrong. On the negative side, a flurry of papers published since 2021 have argued that, as the datasets used to pretrain foundation models incorporate more and more auto-generated data mined from the Internet, performance degrades and the models start to "unlearn" skills.