Appendix: LanguageModelswithImageDescriptors areStrongFew-ShotVideo-LanguageLearners

Feb-8-2026, 08:27:28 GMT–Neural Information Processing Systems

For VaTeX captioning and retrieval, we use the latest v1.1 version3, which contains 25,991 videos for training and 6,000 videos for public testing. The statistics can be found in Table 1. Visual genome synsets are pairs, where the keys are noisy natural language phrases and the values are the mapped WordNet synsets [6]. Ifavisualtokenoccurs in multiple frames, we use the averaged frame indexas its temporal indicator. Specifically,for UniVL, we set the number of epoches to be50 and the linear warmup steps to be40.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Feb-8-2026, 08:27:28 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.05)
- North America > United States
  - Illinois (0.05)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning (0.74)

Duplicate Docs Excel Report

Title
381ceeae4a1feb1abc59c773f7e61839-Supplemental-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found