SupplementaryMaterial: UnifiedVision-Language Pre-TrainingwithMixture-of-Modality-Experts
–Neural Information Processing Systems
We perform finetuning with image-textcontrastiveand image-textmatching losses. During inference, VLMO is first used as a dual encoder to obtain top-k candidates, then the model is used as a fusionencoder torerankthecandidates. For the text-only pre-training data, we use English Wikipedia and BookCorpus [5]. Table 1: Ablation study of the shared self-attention module used in Multiway Transformer.
Neural Information Processing Systems
Feb-12-2026, 03:19:13 GMT