bd3611971089d466ab4ca96a20f7ab13-Supplemental-Datasets_and_Benchmarks.pdf
–Neural Information Processing Systems
B.1 ApplyingViLTtoMulti-ChoiceTasks B.1.1 ApplyingViLTtoVCR The VCR task provides object boxes, with each box corresponding to a grounded entity in the question. We use consistent mappings between the box colors and object names; for example, the[person1]object isalwaysreferenced withagreenboxintheimage, andthename Caseyinthetext. During training and inference, each possible answerai is paired with the questionq, to form a sequence"[CLS] q [SEP]ai". Forvision-only tasks, wefound thatsimply using "This is an image." We also conduct ablation studies that include twobaselines: (1) not inputting anyimage toViLTat all, and (2) inputting the zero-vector image instead of the average image of the COCO dataset.
Neural Information Processing Systems
Feb-11-2026, 16:13:05 GMT
- Technology: