Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner