Meta CLIP 2: A Worldwide Scaling Recipe

Chuang, Yung-Sung, Li, Yang, Wang, Dong, Yeh, Ching-Feng, Lyu, Kehan, Raghavendra, Ramya, Glass, James, Huang, Lifei, Weston, Jason, Zettlemoyer, Luke, Chen, Xinlei, Liu, Zhuang, Xie, Saining, Yih, Wen-tau, Li, Shang-Wen, Xu, Hu

Aug-4-2025–arXiv.org Artificial Intelligence

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Aug-4-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Middle East
    - Israel (0.04)
    - Jordan (0.04)
  - Myanmar (0.04)
- Europe
  - Germany (0.04)
  - Western Europe (0.04)
- North America > United States
  - New York (0.04)
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)
  - Natural Language > Large Language Model (1.00)