Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Pareek, Divyansh, Oh, Sewoong, Du, Simon S.

Dec-17-2025–arXiv.org Machine Learning

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $η\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{η\sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{ηn}}$ in the large $η$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $η$ regime.

contrastive learning, denote, matrix, (16 more...)

arXiv.org Machine Learning

Dec-17-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Washington > King County > Seattle (0.14)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.82)

Industry:
- Education (0.89)

Technology:
- Information Technology
  - Data Science > Data Quality
    - Data Cleaning (0.34)
  - Artificial Intelligence > Machine Learning
    - Neural Networks (0.45)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found