Coupling Speech Encoders with Downstream Text Models
Chelba, Ciprian, Schalkwyk, Johan
–arXiv.org Artificial Intelligence
Automatic speech translation (AST) modeling is usually plagued by lack of parallel training data, which limits the success of end-to-end models. Owing to their modular architecture, cascade models for AST have the advantage of leveraging the large amounts of data available to build automatic speech recognition (ASR) and machine translation (MT) models, respectively. The straightforward way of building cascade AST models is to send the 1-best ASR transcription to the text MT model. Yet another advantage of such an architecture is that it is in fact a multi-modal and multi-task one: besides speech, it also accepts text input for translation and it produces ASR output either in stand-alone mode or as a side-product of the AST task. This multi-input/modal view on the AST task is firmly anchored in the reality of practical applications, so we take it as a fundamental design choice: we aim to build a model that delivers both state of the art ASR and MT performance, while optimizing the AST performance within these constraints. Translating ASR 1-best output has the obvious disadvantage that any further training (fine-tuning) on AST parallel data specific to a given domain is unable to back-propagate cross-entropy loss gradient through the interface between the ASR and the MT model. For tighter coupling between ASR and MT modules we follow the approach of (Dalmia et al., 2021) that leverages the 1-best ASR alignment and sends the ASR encoder embeddings aligned with the 1-best ASR sequence to the MT model. This results in a cascade architecture that allows back-propagation gradient to flow from the MT model into the ASR components. The ASR model in our work uses a conformer encoder architecture (Gulati et al., 2020), pre-trained on a large amount of speech data as described in the Unified Speech Model (USM) work (Zhang et al., 2023).
arXiv.org Artificial Intelligence
Jul-24-2024
- Country:
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language > Machine Translation (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence