Word Level Timestamp Generation for Automatic Speech Recognition and Translation

Hu, Ke, Puvvada, Krishna, Rastorgueva, Elena, Chen, Zhehuai, Huang, He, Ding, Shuoyang, Dhawan, Kunal, Xu, Hainan, Balam, Jagadeesh, Ginsburg, Boris

May-22-2025–arXiv.org Artificial Intelligence

We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate times-tamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

May-22-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Natural Language (1.00)
  - Machine Learning > Performance Analysis
    - Accuracy (0.36)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found