Data curation via joint example selection further accelerates multimodal learning Olivier J. Hénaff

May-25-2025, 21:47:29 GMT–Neural Information Processing Systems

Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly prioritizing batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individuallyprioritized data points. As performance improves by selecting from large superbatches, we also leverage recent advances in model approximation to reduce the computational overhead of scoring.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

May-25-2025, 21:47:29 GMT

Conferences PDF

Add feedback

Country:
- Europe > Netherlands (0.14)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning
      - Inductive Learning (0.46)
      - Neural Networks > Deep Learning (0.46)
    - Natural Language > Large Language Model (0.46)
  - Data Science > Data Quality
    - Data Cleaning (0.71)