Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Neural Information Processing Systems 

ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.