Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Dec-24-2025, 02:49:16 GMT–Neural Information Processing Systems

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images.

momentum distillation, name change, vision and language representation learning, (8 more...)

Neural Information Processing Systems

Dec-24-2025, 02:49:16 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (0.59)
  - Artificial Intelligence
    - Natural Language (0.82)
    - Machine Learning (0.56)