Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition

Chen, Xianzhao, Lin, Yist Y., Wang, Kang, He, Yi, Ma, Zejun

arXiv.org Artificial Intelligence 

In E2E systems, word timings can be estimated by the forced alignment results of character-level CTC models, where End-to-end (E2E) systems have shown comparable performance the CTC peak of the first character indicate the word start time to hybrid systems for automatic speech recognition and the CTC peak of the last character indicate the word end (ASR). Word timings, as a by-product of ASR, are essential time [9]. The CTC model cannot estimate word timings well in many applications, especially for subtitling and computeraided when the duration of the modeling unit is relatively long, e.g., pronunciation training. In this paper, we improve the Chinese characters. Because the blank probability of CTC frame-level classifier for word timings in E2E system by introducing model is dominant in almost all frames, and the non-blank probability label priors in connectionist temporal classification is only relatively high in few frames. This is called the (CTC) loss, which is adopted from prior works, and combining peaky behavior [10]. CTC-based alignments for word timings low-level Mel-scale filter banks with high-level ASR encoder can be improved by alleviating the peaky behavior [11, 12], output as input feature. On the internal Chinese corpus, but these methods have complicated regularization terms which the proposed method achieves 95.68%/94.18%

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found