Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Open in new window