Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation
Zhu, Jiaxu, Tong, Weinan, Xu, Yaoxun, Song, Changhe, Wu, Zhiyong, You, Zhao, Su, Dan, Yu, Dong, Meng, Helen
–arXiv.org Artificial Intelligence
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.
arXiv.org Artificial Intelligence
Oct-7-2023