Understanding Shared Speech-Text Representations
Wang, Gary, Kastner, Kyle, Bapna, Ankur, Chen, Zhehuai, Rosenberg, Andrew, Ramabhadran, Bhuvana, Zhang, Yu
–arXiv.org Artificial Intelligence
In this work, we expand on this understanding in two directions. Recently, a number of approaches to train speech models by incorporating First, we evaluate the ability to transfer information from one domain text into end-to-end models have been developed, with Maestro to another through the joint representation (Section 4). We explore advancing state-of-the-art automatic speech recognition (ASR) which components of the text encoder are robust across corpora, and and Speech Translation (ST) performance. In this paper, we expand which are sensitive. Second, we investigate the modal representations our understanding of the resulting shared speech-text representations from the speech and text encoders (Section 5). We inspect the with two types of analyses. First we examine the limits of speechfree cross-modal consistency loss as a signal of robustness, and the ability domain adaptation, finding that a corpus-specific duration model for this loss term to generalize across corpora through T-SNE for speech-text alignment is the most important component for learning visualization of activations and a retrieval probe task.
arXiv.org Artificial Intelligence
Apr-27-2023
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence