Nepotistically Trained Generative-AI Models Collapse

Bohacek, Matyas, Farid, Hany

arXiv.org Artificial Intelligence 

From text to audio and image, today's generative-AI systems are trained on large quantities of human-generated content. Most of this content is obtained by scraping a variety of online sources. As generative AI becomes more common, it is reasonable to expect that future data scraping will invariably catch generative AI's own creations. We ask what happens when these generative systems are trained on varying combinations of human-generated and AI-generated content. Although it is early in the evolution of generative AI, there is already some evidence that retraining a generative AI model on its own creation - what we call model poisoning - leads to a range of artifacts in the output of the newly trained model. It has been shown, for example, that when retrained on their own output, large language models (LLMs) contain irreversible defects that cause the model to produce gibberish - so-called model collapse [22].