Goto

Collaborating Authors

 Huang, Kaixun


SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

arXiv.org Artificial Intelligence

Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) has demonstrated its effectiveness in multilingual ASR, it is worth noting that the various layers' representations of SSL potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune multilingual ASR. We first analyze the different layers of the SSL model for language-related and content-related information, uncovering layers that show a stronger correlation. Then, we extract a language-related frame from correlated middle layers and guide specific content extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.


Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

arXiv.org Artificial Intelligence

The introduced entity encoder enables the entity list to be By incorporating additional contextual information, deep biasing personalized for individual users. However, this personalization methods have emerged as a promising solution for speech comes at a cost: the model has less prior knowledge of the customized recognition of personalized words. However, for real-world words, which can result in false alarms. In other words, voice assistants, always biasing on such personalized words the model may mistakenly identify non-entity names as entity with high prediction scores can significantly degrade the performance terms, leading to a decrease in overall recognition performance, of recognizing common words. To address this issue, particularly for words that are phonemically similar. For example, we propose an adaptive contextual biasing method based if we add "José" as a context phrase, the ASR system on Context-Aware Transformer Transducer (CATT) that utilizes might falsely recognize "O say can you see" as "José can you the biased encoder and predictor embeddings to perform see". This issue is particularly acute for a general ASR system streaming prediction of contextual phrase occurrences. Such that is not restricted to a particular domain. As a result, this prediction is then used to dynamically switch the bias list on and drawback makes biased models less competitive, as the benefits off, enabling the model to adapt to both personalized and common gained may be outweighed by the negative impact on overall scenarios.


Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

arXiv.org Artificial Intelligence

Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.