Vision Learners Meet Web Image-Text Pairs

Zhao, Bingchen, Cui, Quan, Wu, Hao, Yoshie, Osamu, Yang, Cheng, Mac Aodha, Oisin

Apr-5-2023–arXiv.org Artificial Intelligence

Most recent self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data. First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks. We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner. Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data. MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties. Pre-trained models and code will be made public upon acceptance.

artificial intelligence, machine learning, representation, (19 more...)

arXiv.org Artificial Intelligence

Apr-5-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Poland (0.04)
- Asia
  - Japan (0.04)
  - China > Shanghai
    - Shanghai (0.04)

Genre:
- Personal (0.46)
- Research Report (0.40)
- Instructional Material (0.34)

Industry:
- Leisure & Entertainment (0.93)
- Transportation
  - Infrastructure & Services (0.93)
  - Ground > Road (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Inductive Learning (1.00)
  - Neural Networks (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found