Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
–Neural Information Processing Systems
We present an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space. Sentences are close in this embedding space, independently of their language and modality, either text or audio. Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl. This yielded more than twenty thousand hours of aligned speech translations. To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs.
Neural Information Processing Systems
May-21-2025, 22:19:05 GMT
- Country:
- Europe (0.46)
- North America > United States
- Minnesota > Hennepin County > Minneapolis (0.14)
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Government (0.46)
- Technology: