Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents
–arXiv.org Artificial Intelligence
We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.
arXiv.org Artificial Intelligence
Sep-24-2025
- Country:
- Europe (1.00)
- North America > United States
- Minnesota (0.28)
- Asia > Middle East
- UAE (0.46)
- Genre:
- Research Report
- New Finding (0.68)
- Experimental Study (0.47)
- Research Report
- Industry:
- Materials > Metals & Mining (0.59)
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language > Machine Translation (1.00)
- Machine Learning (1.00)
- Speech > Speech Recognition (0.89)
- Information Technology > Artificial Intelligence