Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Neural Information Processing Systems 

ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found