Soft Alignment of Modality Space for End-to-end Speech Translation

Zhang, Yuhao, Kou, Kaiqi, Li, Bei, Xu, Chen, Zhang, Chunliang, Xiao, Tong, Zhu, Jingbo

Dec-18-2023–arXiv.org Artificial Intelligence

End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.

proc, representation, translation, (14 more...)

arXiv.org Artificial Intelligence

Dec-18-2023

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China
  - Liaoning Province > Shenyang (0.04)
  - Heilongjiang Province > Harbin (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Natural Language > Machine Translation (1.00)
  - Machine Learning (1.00)