Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training

Open in new window