Learning from Synthetic Data: Limitations of ERM
Amin, Kareem, Bie, Alex, Kong, Weiwei, Syed, Umar, Vassilvitskii, Sergei
The first generation of LLMs were largely trained on human-generated data. However, the success of LLMs and their increased adoption has had an unexpected consequence of AI-generated content appearing in places where there was previously none. Thus machine learning practitioners should be aware that there is an increased chance that their training data is contaminated by LLM-generated content. Previous work has looked into the value of synthetic (i.e., AI-generated) data, and showed that while naively adding this data to the training mix may lead to model collapse, being more diligent about which data is added, the amount of curation it undergoes, and the specifics of the training process may mitigate that risk, or reverse it, leading to improved performance. These works almost uniquely focus on the LLM setting, trying to improve state of the art performance on a set of benchmarks. In contrast, in this work we take a traditional learning theory view on this problem. We begin by formalizing the setting and developing a framework that captures the invariants of having natural training data contaminated by synthetic additions. Specifically, we see three salient points: Groundtruth. There exists a (potentially small) set of natural data, coming from the true data generation distribution.
Jan-23-2026
- Country:
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Genre:
- Research Report (0.64)
- Industry:
- Education (0.34)
- Technology: