Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training