Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Open in new window