Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Zhu, Jiaxu, Tong, Weinan, Xu, Yaoxun, Song, Changhe, Wu, Zhiyong, You, Zhao, Su, Dan, Yu, Dong, Meng, Helen

arXiv.org Artificial Intelligence 

Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found