CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations