Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Puvvada, Krishna C., Żelasko, Piotr, Huang, He, Hrinchuk, Oleksii, Koluguri, Nithin Rao, Dhawan, Kunal, Majumdar, Somshubra, Rastorgueva, Elena, Chen, Zhehuai, Lavrukhin, Vitaly, Balam, Jagadeesh, Ginsburg, Boris

Jun-28-2024–arXiv.org Artificial Intelligence

It was observed in [6] that such long utterances harm the model convergence. We also note that this Recent advances in speech recognition and translation rely on approach may lead to significant padding in mini-batches, resulting hundreds of thousands of hours of Internet speech data. We argue in wasted computation on non-informative frames. We that state-of-the art accuracy can be reached without relying on present an alternative approach to sampling and batching that web-scale data. Canary - multilingual ASR and speech translation allows us to iterate through data twice as fast, while balancing model, outperforms current state-of-the-art models - Whisper, different languages and data sources better. We further accelerate OWSM, and Seamless-M4T on English, French, Spanish, and the training and inference by adopting a FastConformer [7] architecture German languages, while being trained on an order of magnitude and initializing the encoder from a ASR only pretrained less data than these models. Three key factors enables such dataefficient checkpoint.

canary, dataset, wer, (12 more...)

arXiv.org Artificial Intelligence

Jun-28-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - California > Santa Clara County
    - Santa Clara (0.04)

Genre:
- Research Report (0.84)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Machine Learning (1.00)
  - Natural Language > Machine Translation (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found