Coarse-to-FineVision-LanguagePre-trainingwith FusionintheBackbone

Neural Information Processing Systems 

Vision-language (VL) pre-training has recently received considerable attention.