Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment