SupplementaryMaterial: UnifiedVision-Language Pre-TrainingwithMixture-of-Modality-Experts

Neural Information Processing Systems 

We perform finetuning with image-textcontrastiveand image-textmatching losses. During inference, VLMO is first used as a dual encoder to obtain top-k candidates, then the model is used as a fusionencoder torerankthecandidates. For the text-only pre-training data, we use English Wikipedia and BookCorpus [5]. Table 1: Ablation study of the shared self-attention module used in Multiway Transformer.