Data curation via joint example selection further accelerates multimodal learning

Dec-27-2025, 14:39:05 GMT–Neural Information Processing Systems

Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly prioritizing batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from large super-batches, we also leverage recent advances in model approximation to reduce the computational overhead of scoring.

artificial intelligence, machine learning, proceedings, (7 more...)

Neural Information Processing Systems

Dec-27-2025, 14:39:05 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.58)