OmniVL: One Foundation Model for Image-Language

Neural Information Processing Systems 

Our setup is based on the following considerations. The default settings for finetuning on each dataset are shown in Table 1. Table 1: End-to-end finetuning configurations for image-language downstream tasks.Config COCO (retrieval) & Flickr30k COCO (captioning) VQA optimizer AdamW AdamW AdamW base learning rate 1e-5 1e-5 2e-5 weight decay 0.05 0.05 0.05 learning rate schedule linear decay linear decay linear decay batch size 512 512 256 training epochs 10 10 10 C.2 Video-Language T asks We demonstrate more comparison results using different pretraining paradigms ( i.e., image-only, Details of the pretraining data can be found in Table 4. "img2vid" strategy is also adopted for further comparison, where we start with image-only pretraining We can see that the captions generated by OmniVL are both natural and abundant. OmniVL can generate more fine-grained descriptions (line 1). Figure 4: Some video captions generated by OmniVL.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found