Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Jun-14-2026, 08:56:08 GMT–Neural Information Processing Systems

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting η (0,1] as the fraction of data with correctly matched modalities among npaired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: (i) the error without filtering is upper and lower bounded by 1/η n, and (ii)the error with teacher-based filtering is upper bounded by 1/ ηn in the large η regime, and by 1/ n in the small ηregime.

artificial intelligence, data quality, machine learning, (19 more...)

Neural Information Processing Systems

Jun-14-2026, 08:56:08 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Education (1.00)

Technology:
- Information Technology
  - Data Science > Data Quality
    - Data Cleaning (0.34)
  - Artificial Intelligence > Machine Learning
    - Neural Networks (0.45)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found