Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering