Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection