Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training