792dd774336314c3c27a04bb260cf2cf-Supplemental.pdf

Neural Information Processing Systems 

Finally,we train our model for 8hours on asingle V100GPU. We provide an illustration of our weakly supervised phrase grounding model in Figure 4b (this supplemental). Specifically,we create context-preserving negativecaptions for an image by substituting anoun in its original caption with negativenouns, that are sampled from apretrained BERT [17] model. Forexample,inthecase where only one cross-attention layer is used, adding the sentence-level contrastive loss leads to a 2.5%intheR@1accuracy. These videos contain transcribed narrations thatareeither uploaded manually byusersor aretheoutputofanautomatic speech recognition (ASR) system.

Duplicate Docs Excel Report

Similar Docs  Excel Report  more

TitleSimilaritySource
None found