Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition