Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings
Mattioli, Lucas, Hadichou, Youness Ait, Chaouche, Sabrina, Gonzalez, Martin
–arXiv.org Artificial Intelligence
Training models on uncurated Text Embeddings (TEs) derived from raw tabular data can lead to a severe failure mode known as model collapse, where predictions converge to a single class regardless of input. By comparing models trained with identical hyper-parameter configurations on both raw tabular data and their TE-derived counterparts, we find that collapse is a consistent failure mode in the latter setting. We introduce a set of metrics that capture the extent of model collapse, offering a new perspective on TE quality as a proxy for data curation. Our results reveal that TE alone does not effectively function as a curation layer - and that their quality significantly influences downstream learning. More insidiously, we observe that the presence of model collapse can yield artificially inflated and spurious Accuracy-on-the-Line correlation. These findings highlight the need for more nuanced curation and evaluation of embedding-based representations, particularly in out-of-distribution settings.
arXiv.org Artificial Intelligence
Jun-24-2025
- Country:
- Europe > France (0.14)
- North America
- Canada (0.04)
- Puerto Rico (0.04)
- United States
- Alabama (0.04)
- Alaska (0.04)
- Arizona (0.04)
- Arkansas (0.04)
- California (0.04)
- Genre:
- Research Report > New Finding (0.88)
- Technology: