Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Neural Information Processing Systems 

Vision-language (VL) pre-training has recently received considerable attention.