bd3611971089d466ab4ca96a20f7ab13-Supplemental-Datasets_and_Benchmarks.pdf

Feb-11-2026, 16:13:05 GMT–Neural Information Processing Systems

B.1 ApplyingViLTtoMulti-ChoiceTasks B.1.1 ApplyingViLTtoVCR The VCR task provides object boxes, with each box corresponding to a grounded entity in the question. We use consistent mappings between the box colors and object names; for example, the[person1]object isalwaysreferenced withagreenboxintheimage, andthename Caseyinthetext. During training and inference, each possible answerai is paired with the questionq, to form a sequence"[CLS] q [SEP]ai". Forvision-only tasks, wefound thatsimply using "This is an image." We also conduct ablation studies that include twobaselines: (1) not inputting anyimage toViLTat all, and (2) inputting the zero-vector image instead of the average image of the COCO dataset.

artificial intelligence, clalgorithm, machine learning, (10 more...)

Neural Information Processing Systems

Feb-11-2026, 16:13:05 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)

Duplicate Docs Excel Report

Title
A Task Details

Similar Docs Excel Report more

Title	Similarity	Source
None found