Understanding the Gain from Data Filtering in Multimodal Contrastive Learning
–Neural Information Processing Systems
The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting η (0,1] as the fraction of data with correctly matched modalities among npaired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: (i) the error without filtering is upper and lower bounded by 1/η n, and (ii)the error with teacher-based filtering is upper bounded by 1/ ηn in the large η regime, and by 1/ n in the small ηregime.
Neural Information Processing Systems
Jun-14-2026, 08:56:08 GMT