Coupling Speech Encoders with Downstream Text Models

Jul-24-2024–arXiv.org Artificial Intelligence

Automatic speech translation (AST) modeling is usually plagued by lack of parallel training data, which limits the success of end-to-end models. Owing to their modular architecture, cascade models for AST have the advantage of leveraging the large amounts of data available to build automatic speech recognition (ASR) and machine translation (MT) models, respectively. The straightforward way of building cascade AST models is to send the 1-best ASR transcription to the text MT model. Yet another advantage of such an architecture is that it is in fact a multi-modal and multi-task one: besides speech, it also accepts text input for translation and it produces ASR output either in stand-alone mode or as a side-product of the AST task. This multi-input/modal view on the AST task is firmly anchored in the reality of practical applications, so we take it as a fundamental design choice: we aim to build a model that delivers both state of the art ASR and MT performance, while optimizing the AST performance within these constraints. Translating ASR 1-best output has the obvious disadvantage that any further training (fine-tuning) on AST parallel data specific to a given domain is unable to back-propagate cross-entropy loss gradient through the interface between the ASR and the MT model. For tighter coupling between ASR and MT modules we follow the approach of (Dalmia et al., 2021) that leverages the 1-best ASR alignment and sends the ASR encoder embeddings aligned with the 1-best ASR sequence to the MT model. This results in a cascade architecture that allows back-propagation gradient to flow from the MT model into the ASR components. The ASR model in our work uses a conformer encoder architecture (Gulati et al., 2020), pre-trained on a large amount of speech data as described in the Unified Speech Model (USM) work (Zhang et al., 2023).

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jul-24-2024

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - California > Santa Clara County
    - Mountain View (0.04)
- Europe > Belgium
  - Brussels-Capital Region > Brussels (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Natural Language > Machine Translation (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found