Word Level Timestamp Generation for Automatic Speech Recognition and Translation

Hu, Ke, Puvvada, Krishna, Rastorgueva, Elena, Chen, Zhehuai, Huang, He, Ding, Shuoyang, Dhawan, Kunal, Xu, Hainan, Balam, Jagadeesh, Ginsburg, Boris

arXiv.org Artificial Intelligence 

We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate times-tamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found