ForTIFAI: Fending Off Recursive Training Induced Failure for AI Model Collapse
Shabgahi, Soheil Zibakhsh, Aghazadeh, Pedram, Mirhoseini, Azalia, Koushanfar, Farinaz
–arXiv.org Artificial Intelligence
The increasing reliance on generative AI models is rapidly increasing the volume of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030 Gartner, Inc. (2022). This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. While the causes of model collapse are increasingly understood, effective mitigation strategies remain scarce. We address this challenge by leveraging a key insight: auto-regressive models tend to generate text sequences to which they assign high confidence (i.e., high log-likelihood). Based on this observation, we introduce the Truncated-Cross-Entropy (TCE) loss function. Our experiments demonstrate that models trained with TCE not only learn effectively but also exhibit significantly increased resilience, tolerating over 2.3 more synthetic data before the onset of collapse. In addition, we provide an open-source benchmark for collapse dynamics in mixed-data settings. Our results demonstrate that confidence-aware training objectives can substantially delay collapse onset, offering a practical and generalizable tool for model robustness under synthetic-data exposure. Generative models have become the foundation for modern AI applications in several modalities, including text, image, code, and audio. Large Language Models (LLMs) such as ChatGPT (OpenAI et al., 2024), LLaMA (Grattafiori et al., 2024) and Gemma (Team et al., 2025), as well as image generators DALL-E (Ramesh et al., 2021) and Imagen (Saharia et al., 2022), all rely on large datasets scraped from the Web. As these models are continuously updated to reflect recent knowledge and linguistic patterns, the need for ever larger and frequently refreshed training corpora has grown substantially. However, this demand is colliding with a shift in the data landscape: synthetic content is increasingly populating the Internet, contaminating the very datasets used for model training. This shift raises fundamental concerns.
arXiv.org Artificial Intelligence
Nov-6-2025
- Country:
- Africa > Mali (0.04)
- Asia > Middle East
- Jordan (0.06)
- Saudi Arabia > Asir Province
- Abha (0.04)
- Europe > Monaco (0.04)
- North America > United States
- California
- San Diego County > San Diego (0.04)
- Santa Clara County > Palo Alto (0.04)
- California
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (0.46)
- Information Technology > Services (0.34)
- Leisure & Entertainment (0.46)
- Technology: