How Much Context Does My Attention-Based ASR System Need?

Oct-24-2023–arXiv.org Artificial Intelligence

For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon, and under-investigated in literature. In this work, we examine the effect of scaling the sequence length used to train/evaluate (dense-attention based) acoustic and language models on speech recognition performance. For these experiments a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations on long-format datasets Earnings-22 and Tedlium demonstrate a benefit from training with around 80 seconds of acoustic context, showing up to a 14.9% relative improvement from a limited context baseline. Furthermore, we perform a system combination with long-context transformer language models via beam search for a fully long-context ASR system, with results that are competitive with the current state-of-the-art.

context length, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

Oct-24-2023

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom (0.14)

Genre:
- Research Report (0.64)

Industry:
- Leisure & Entertainment (0.35)
- Media (0.35)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language (1.00)
  - Speech > Speech Recognition (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found