Data curation via joint example selection further accelerates multimodal learning
–Neural Information Processing Systems
Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly prioritizing batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from large super-batches, we also leverage recent advances in model approximation to reduce the computational overhead of scoring.
Neural Information Processing Systems
Dec-27-2025, 14:39:05 GMT
- Technology: