Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Puvvada, Krishna C., Żelasko, Piotr, Huang, He, Hrinchuk, Oleksii, Koluguri, Nithin Rao, Dhawan, Kunal, Majumdar, Somshubra, Rastorgueva, Elena, Chen, Zhehuai, Lavrukhin, Vitaly, Balam, Jagadeesh, Ginsburg, Boris

arXiv.org Artificial Intelligence 

It was observed in [6] that such long utterances harm the model convergence. We also note that this Recent advances in speech recognition and translation rely on approach may lead to significant padding in mini-batches, resulting hundreds of thousands of hours of Internet speech data. We argue in wasted computation on non-informative frames. We that state-of-the art accuracy can be reached without relying on present an alternative approach to sampling and batching that web-scale data. Canary - multilingual ASR and speech translation allows us to iterate through data twice as fast, while balancing model, outperforms current state-of-the-art models - Whisper, different languages and data sources better. We further accelerate OWSM, and Seamless-M4T on English, French, Spanish, and the training and inference by adopting a FastConformer [7] architecture German languages, while being trained on an order of magnitude and initializing the encoder from a ASR only pretrained less data than these models. Three key factors enables such dataefficient checkpoint.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found