A Downstream Task Details

Neural Information Processing Systems 

Here we describe the implementation details for fine-tuning the pre-trained model. We consider two datasets for this task: COCO and Flickr30K. We follow the original dataset split with 29.8k images for training, 1k for evaluation, and 1k for test. It contains 83k images for training, 41k for validation, and 81k for test. Quantitative comparison between ITC and ITM is shown in Table 5. Figure 7 shows the qualitative "small black bag" "the larger black suitcase" "elephant with trunk curled" "elephant with trunk up" ITC ITM Grad-CAMs from the multimodal encoder capture finer-grained details such as "larger" and "curled".

Similar Docs  Excel Report  more

TitleSimilaritySource
None found