A Downstream Task Details
–Neural Information Processing Systems
Here we describe the implementation details for fine-tuning the pre-trained model. We consider two datasets for this task: COCO and Flickr30K. We follow the original dataset split with 29.8k images for training, 1k for evaluation, and 1k for test. It contains 83k images for training, 41k for validation, and 81k for test. Quantitative comparison between ITC and ITM is shown in Table 5. Figure 7 shows the qualitative "small black bag" "the larger black suitcase" "elephant with trunk curled" "elephant with trunk up" ITC ITM Grad-CAMs from the multimodal encoder capture finer-grained details such as "larger" and "curled".
Neural Information Processing Systems
Nov-14-2025, 03:17:15 GMT
- Country:
- Asia > Middle East > Republic of Türkiye (0.05)
- Industry:
- Leisure & Entertainment (0.70)
- Media > Music (0.48)
- Technology: