Cluster and Predict Latent Patches for Improved Masked Image Modeling

Darcet, Timothée, Baldassarre, Federico, Oquab, Maxime, Mairal, Julien, Bojanowski, Piotr

Feb-17-2025–arXiv.org Artificial Intelligence

Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper, we systematically analyze target representations, loss functions, and architectures, to introduce CAPI - a novel pure-MIM framework that relies on the prediction of latent clusterings. Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes, substantially outperforming previous MIM methods and approaching the performance of the current state-of-the-art, DINOv2. We release all our code and models.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-17-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - New Finding (0.46)
  - Promising Solution (0.48)

Industry:
- Education (1.00)
- Health & Medicine > Diagnostic Medicine (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning
      - Neural Networks (0.93)
      - Statistical Learning (1.00)
    - Natural Language (1.00)
    - Representation & Reasoning (0.68)
    - Vision > Image Understanding (0.68)
  - Sensing and Signal Processing > Image Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found