Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Open in new window