Multi-Modal Pre-Training for Automated Speech Recognition

Chan, David M., Ghosh, Shalini, Chakrabarty, Debmalya, Hoffmeister, Björn

Sep-15-2022–arXiv.org Artificial Intelligence

Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach which leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets range from 6% (on larger models) to 45% (on smaller models).

artificial intelligence, machine learning, representation, (14 more...)

arXiv.org Artificial Intelligence

Sep-15-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > Alameda County > Berkeley (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (0.74)
  - Machine Learning
    - Neural Networks (0.68)
    - Inductive Learning (0.55)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found