Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces