792dd774336314c3c27a04bb260cf2cf-Supplemental.pdf

Feb-9-2026, 11:13:45 GMT–Neural Information Processing Systems

Finally,we train our model for 8hours on asingle V100GPU. We provide an illustration of our weakly supervised phrase grounding model in Figure 4b (this supplemental). Specifically,we create context-preserving negativecaptions for an image by substituting anoun in its original caption with negativenouns, that are sampled from apretrained BERT [17] model. Forexample,inthecase where only one cross-attention layer is used, adding the sentence-level contrastive loss leads to a 2.5%intheR@1accuracy. These videos contain transcribed narrations thatareeither uploaded manually byusersor aretheoutputofanautomatic speech recognition (ASR) system.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Feb-9-2026, 11:13:45 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.52)
  - Machine Learning (0.47)

Duplicate Docs Excel Report

Title
A Supplementary

Similar Docs Excel Report more

Title	Similarity	Source
None found