Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition
Chen, Xianzhao, Lin, Yist Y., Wang, Kang, He, Yi, Ma, Zejun
–arXiv.org Artificial Intelligence
In E2E systems, word timings can be estimated by the forced alignment results of character-level CTC models, where End-to-end (E2E) systems have shown comparable performance the CTC peak of the first character indicate the word start time to hybrid systems for automatic speech recognition and the CTC peak of the last character indicate the word end (ASR). Word timings, as a by-product of ASR, are essential time [9]. The CTC model cannot estimate word timings well in many applications, especially for subtitling and computeraided when the duration of the modeling unit is relatively long, e.g., pronunciation training. In this paper, we improve the Chinese characters. Because the blank probability of CTC frame-level classifier for word timings in E2E system by introducing model is dominant in almost all frames, and the non-blank probability label priors in connectionist temporal classification is only relatively high in few frames. This is called the (CTC) loss, which is adopted from prior works, and combining peaky behavior [10]. CTC-based alignments for word timings low-level Mel-scale filter banks with high-level ASR encoder can be improved by alleviating the peaky behavior [11, 12], output as input feature. On the internal Chinese corpus, but these methods have complicated regularization terms which the proposed method achieves 95.68%/94.18%
arXiv.org Artificial Intelligence
Jun-8-2023
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence