Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Apr-25-2026, 21:26:27 GMT–Neural Information Processing Systems

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Apr-25-2026, 21:26:27 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Natural Language > Text Processing (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
Alignbefore Fuse: Visionand Language Representation Learningwith Momentum Distillation

Similar Docs Excel Report more

Title	Similarity	Source
None found