AITopics | Dong, Linhao

Plotting

Dong, Linhao

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition

Shao, Qijie, Dong, Linhao, Wei, Kun, Sun, Sining, Xie, Lei

arXiv.org Artificial IntelligenceJan-23-2025

Data2vec is a self-supervised learning (SSL) approach that employs a teacher-student architecture for contextual representation learning via masked prediction, demonstrating remarkable performance in monolingual ASR. Previous studies have revealed that data2vec's shallow layers capture speaker and language information, middle layers encode phoneme and word features, while deep layers are responsible for reconstruction. Language and phoneme features are crucial for multilingual ASR. However, data2vec's masked representation generation relies on multi-layer averaging, inevitably coupling these features. To address this limitation, we propose a decoupling quantization based data2vec (DQ-Data2vec) for multilingual ASR, which includes a data2vec backbone and two improved online K-means quantizers. Our core idea is using the K-means quantizer with specified cluster numbers to decouple language and phoneme information for masked prediction. Specifically, in the language quantization, considering that the number of languages is significantly different from other irrelevant features (e.g., speakers), we assign the cluster number to match the number of languages, explicitly decoupling shallow layers' language-related information from irrelevant features. This strategy is also applied to decoupling middle layers' phoneme and word features. In a self-supervised scenario, experiments on the CommonVoice dataset demonstrate that DQ-Data2vec achieves a relative reduction of 9.51% in phoneme error rate (PER) and 11.58% in word error rate (WER) compared to data2vec and UniData2vec. Moreover, in a weakly-supervised scenario incorporating language labels and high-resource language text labels, the relative reduction is 18.09% and 1.55%, respectively.

artificial intelligence, information, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.13497

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

Dong, Linhao, An, Zhecheng, Wu, Peihao, Zhang, Jun, Lu, Lu, Ma, Zejun

arXiv.org Artificial IntelligenceMay-27-2023

Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridge the representations between speech and text. It jointly performs speech-to-text training and language model distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling, respectively. We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.17499

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

cif-based collaborative decoding for end-to-end contextual speech recognition

Han, Minglun, Dong, Linhao, Zhou, Shiyu, Xu, Bo

arXiv.org Artificial IntelligenceDec-17-2020

End-to-end (E2E) models have achieved promising results on multiple speech recognition benchmarks, and shown the potential to become the mainstream. However, the unified structure and the E2E training hamper injecting contextual information into them for contextual biasing. Though contextual LAS (CLAS) gives an excellent all-neural solution, the degree of biasing to given context information is not explicitly controllable. In this paper, we focus on incorporating context information into the continuous integrate-and-fire (CIF) based model that supports contextual biasing in a more controllable fashion. Specifically, an extra context processing network is introduced to extract contextual embeddings, integrate acoustically relevant context information and decode the contextual output distribution, thus forming a collaborative decoding with the decoder of the CIF-based model. Evaluated on the named entity rich evaluation sets of HKUST/AISHELL-2, our method brings relative character error rate (CER) reduction of 8.83%/21.13% and relative named entity character error rate (NE-CER) reduction of 40.14%/51.50% when compared with a strong baseline. Besides, it keeps the performance on original evaluation set without degradation.

artificial intelligence, speech recognition, text processing, (17 more...)

arXiv.org Artificial Intelligence

2012.09466

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.74)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.71)

Add feedback