ForTIFAI: Fending Off Recursive Training Induced Failure for AI Model Collapse

Shabgahi, Soheil Zibakhsh, Aghazadeh, Pedram, Mirhoseini, Azalia, Koushanfar, Farinaz

Nov-6-2025–arXiv.org Artificial Intelligence

The increasing reliance on generative AI models is rapidly increasing the volume of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030 Gartner, Inc. (2022). This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. While the causes of model collapse are increasingly understood, effective mitigation strategies remain scarce. We address this challenge by leveraging a key insight: auto-regressive models tend to generate text sequences to which they assign high confidence (i.e., high log-likelihood). Based on this observation, we introduce the Truncated-Cross-Entropy (TCE) loss function. Our experiments demonstrate that models trained with TCE not only learn effectively but also exhibit significantly increased resilience, tolerating over 2.3 more synthetic data before the onset of collapse. In addition, we provide an open-source benchmark for collapse dynamics in mixed-data settings. Our results demonstrate that confidence-aware training objectives can substantially delay collapse onset, offering a practical and generalizable tool for model robustness under synthetic-data exposure. Generative models have become the foundation for modern AI applications in several modalities, including text, image, code, and audio. Large Language Models (LLMs) such as ChatGPT (OpenAI et al., 2024), LLaMA (Grattafiori et al., 2024) and Gemma (Team et al., 2025), as well as image generators DALL-E (Ramesh et al., 2021) and Imagen (Saharia et al., 2022), all rely on large datasets scraped from the Web. As these models are continuously updated to reflect recent knowledge and linguistic patterns, the need for ever larger and frequently refreshed training corpora has grown substantially. However, this demand is colliding with a shift in the data landscape: synthetic content is increasingly populating the Internet, contaminating the very datasets used for model training. This shift raises fundamental concerns.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Nov-6-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Leisure & Entertainment (0.46)
- Education (0.46)
- Information Technology > Services (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found