Goto

Collaborating Authors

 betterviewincolor


ContrastiveLanguage-ImagePre-Trainingwith KnowledgeGraphs-SupplementaryMaterial

Neural Information Processing Systems

In this way, the modality of the concept in different13 triplets or training batches can be different, and the triplet forms can include image/text, relation,14 image/text. Thenodes15 are presented in a bounding box and the edges are represented by word tokens, e.g., standing on.16 For each input modality in the training data, we adopt a unified processing procedure to make it23 possible for batch training. Specifically, the length of the image is set as 16x16 and the length of24 thetextissetas77. VE task is similar to VQA, which also takes image-text pair as input.