Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark