A Multimodal Task Details
–Neural Information Processing Systems
Table 4 shows details about individual multimodal tasks, including hyperparameters used to train ViLT for each task, and details about how low-shot versions of each task are sampled. The 4 output labels in VCR are not semantically meaningful (since the options are interchangeable); hence, instead of sampling an equal number of training samples per label, we sample a percentage of the full training data instead. For VQAv2, the output label space is very large, and answers are not uniformly distributed across the training data, so instead of sampling N shots per output label (answer) we again sample a percentage of the full VQAv2 training data. B.1 Applying ViLT to Multi-Choice Tasks B.1.1 Applying ViLT to VCR The VCR task provides object boxes, with each box corresponding to a grounded entity in the question. Unlike other pre-trained vision-language encoders [Su et al., 2019, Chen et al., 2020] that use visual features from regions-of-interest (ROIs) in the image, ViLT is designed to operate over image patches, thus making it challenging to use the object box inputs provided in the VCR task.
Neural Information Processing Systems
Feb-10-2025, 07:14:10 GMT