Guo, Pengcheng
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
Geng, Xuelong, Wei, Kun, Shao, Qijie, Liu, Shuiyun, Lin, Zhennan, Zhao, Zhixian, Li, Guojian, Tian, Wenjie, Chen, Peikun, Li, Yangze, Guo, Pengcheng, Shao, Mingchen, Wang, Shuiyuan, Cao, Yuang, Wang, Chengyou, Xu, Tianyi, Dai, Yuhang, Zhu, Xinfa, Li, Yue, Zhang, Li, Xie, Lei
Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.
Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets
Geng, Xuelong, Xu, Tianyi, Wei, Kun, Mu, Bingshen, Xue, Hongfei, Wang, He, Li, Yangze, Guo, Pengcheng, Dai, Yuhang, Li, Longhao, Shao, Mingchen, Xie, Lei
Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.
ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge
Wang, He, Guo, Pengcheng, Li, Yue, Zhang, Ao, Sun, Jiayao, Xie, Lei, Chen, Wei, Zhou, Pan, Bu, Hui, Xu, Xin, Zhang, Binbin, Chen, Zhuo, Wu, Jian, Wang, Longbiao, Chng, Eng Siong, Li, Sun
To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours of noise for data augmentation. Two tracks, including automatic speech recognition (ASR) and automatic speech diarization and recognition (ASDR) are set up, using character error rate (CER) and concatenated minimum permutation character error rate (cpCER) as evaluation metrics, respectively. Overall, the ICMC-ASR Challenge attracts 98 participating teams and receives 53 valid results in both tracks. In the end, first-place team USTCiflytek achieves a CER of 13.16% in the ASR track and a cpCER of 21.48% in the ASDR track, showing an absolute improvement of 13.08% and 51.4% compared to our challenge baseline, respectively.
The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023
Wang, He, Guo, Pengcheng, Chen, Wei, Zhou, Pan, Xie, Lei
This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after multi-system fusion, ranking first place in all three tracks we participate.
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition
Wang, He, Guo, Pengcheng, Zhou, Pan, Xie, Lei
While automatic speech recognition (ASR) systems degrade significantly Following this, plenty of studies have adopted a cross-attention module in noisy environments, audio-visual speech recognition to capture inherent alignments and complementary information (AVSR) systems aim to complement the audio stream with noiseinvariant between fully encoded audio-visual representations [9, 10, 11]. Additionally, visual cues and improve the system's robustness. However, some works directly concatenate the raw speech and video current studies mainly focus on fusing the well-learned modality sequences together and employ a shared encoder with self-attention features, like the output of modality-specific encoders, without mechanisms to learn modality alignments [2, 12]. In [13, 14], considering the contextual relationship during the modality feature hidden features from different layers of audio and visual encoders learning. In this study, we propose a multi-layer cross-attention were leveraged to achieve more effective fusion, indicating that conducting fusion based AVSR (MLCA-AVSR) approach that promotes representation multi-layer fusion can promote the performance of AVSR learning of each modality by fusing them at different levels systems. of audio/visual encoders. Experimental results on the MISP2022-Recently, the Multi-modal Information based Speech Processing AVSR Challenge dataset show the efficacy of our proposed system, (MISP) Challenge series [15, 16, 17] has been introduced to achieving a concatenated minimum permutation character error rate explore the utilization of both audio and visual data in distant multimicrophone (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative signal processing tasks, like keyword spotting and improvement compared with our previous system which ranked speech recognition.
BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
Liang, Yuhao, Yu, Fan, Li, Yangze, Guo, Pengcheng, Zhang, Shiliang, Chen, Qian, Xie, Lei
The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study
Chang, Xuankai, Yan, Brian, Choi, Kwanghee, Jung, Jeeweon, Lu, Yichen, Maiti, Soumi, Sharma, Roshan, Shi, Jiatong, Tian, Jinchuan, Watanabe, Shinji, Fujita, Yuya, Maekaku, Takashi, Guo, Pengcheng, Cheng, Yao-Fei, Denisov, Pavel, Saijo, Kohei, Wang, Hsiu-Hsuan
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations, which significantly compresses the size of speech data. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length. Hence, training time is significantly reduced while retaining notable performance. In this study, we undertake a comprehensive and systematic exploration into the application of discrete units within end-to-end speech processing models. Experiments on 12 automatic speech recognition, 3 speech translation, and 1 spoken language understanding corpora demonstrate that discrete units achieve reasonably good results in almost all the settings. We intend to release our configurations and trained models to foster future research efforts.
Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition
Xu, Tianyi, Yang, Zhanheng, Huang, Kaixun, Guo, Pengcheng, Zhang, Ao, Li, Biao, Chen, Changru, Li, Chao, Xie, Lei
The introduced entity encoder enables the entity list to be By incorporating additional contextual information, deep biasing personalized for individual users. However, this personalization methods have emerged as a promising solution for speech comes at a cost: the model has less prior knowledge of the customized recognition of personalized words. However, for real-world words, which can result in false alarms. In other words, voice assistants, always biasing on such personalized words the model may mistakenly identify non-entity names as entity with high prediction scores can significantly degrade the performance terms, leading to a decrease in overall recognition performance, of recognizing common words. To address this issue, particularly for words that are phonemically similar. For example, we propose an adaptive contextual biasing method based if we add "Josรฉ" as a context phrase, the ASR system on Context-Aware Transformer Transducer (CATT) that utilizes might falsely recognize "O say can you see" as "Josรฉ can you the biased encoder and predictor embeddings to perform see". This issue is particularly acute for a general ASR system streaming prediction of contextual phrase occurrences. Such that is not restricted to a particular domain. As a result, this prediction is then used to dynamically switch the bias list on and drawback makes biased models less competitive, as the benefits off, enabling the model to adapt to both personalized and common gained may be outweighed by the negative impact on overall scenarios.
Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network
Huang, Kaixun, Zhang, Ao, Yang, Zhanheng, Guo, Pengcheng, Mu, Bingshen, Xu, Tianyi, Xie, Lei
Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.
TVDO: Tchebycheff Value-Decomposition Optimization for Multi-Agent Reinforcement Learning
Hu, Xiaoliang, Guo, Pengcheng, Zhou, Chuanwei, Zhang, Tong, Cui, Zhen
In cooperative multi-agent reinforcement learning (MARL) settings, the centralized training with decentralized execution (CTDE) becomes customary recently due to the physical demand. However, the most dilemma is the inconsistency of jointly-trained policies and individually-optimized actions. In this work, we propose a novel value-based multi-objective learning approach, named Tchebycheff value decomposition optimization (TVDO), to overcome the above dilemma. In particular, a nonlinear Tchebycheff aggregation method is designed to transform the MARL task into multi-objective optimal counterpart by tightly constraining the upper bound of individual action-value bias. We theoretically prove that TVDO well satisfies the necessary and sufficient condition of individual global max (IGM) with no extra limitations, which exactly guarantees the consistency between the global and individual optimal action-value function. Empirically, in the climb and penalty game, we verify that TVDO represents precisely from global to individual value factorization with a guarantee of the policy consistency. Furthermore, we also evaluate TVDO in the challenging scenarios of StarCraft II micromanagement tasks, and extensive experiments demonstrate that TVDO achieves more competitive performances than several state-of-the-art MARL methods.