Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training