Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training