Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

Sep-24-2025–arXiv.org Artificial Intelligence

We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Sep-24-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States
  - Minnesota (0.28)
- Asia > Middle East
  - UAE (0.46)

Genre:
- Research Report
  - New Finding (0.68)
  - Experimental Study (0.47)

Industry:
- Materials > Metals & Mining (0.59)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (1.00)
  - Machine Learning (1.00)
  - Speech > Speech Recognition (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found