Wang, Tianrui
Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding
Zhao, Jiahui, Shi, Hao, Cui, Chenrui, Wang, Tianrui, Liu, Hexin, Ni, Zhaoheng, Ye, Lingxuan, Wang, Longbiao
Code-switching (CS) automatic speech recognition (ASR) faces challenges due to the language confusion resulting from accents, auditory similarity, and seamless language switches. Adaptation on the pre-trained multi-lingual model has shown promising performance for CS-ASR. In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. First, we propose an encoder refiner to enhance the encoder's capacity of intra-sentence swithching. Second, we propose using two sets of language-aware adapters with different language prompt embeddings to achieve language-specific decoding information in each decoder layer. Then, a fusion module is added to fuse the language-aware decoding. The experimental results using the SEAME dataset show that, compared with the baseline model, the proposed approach achieves a relative MER reduction of 4.1% and 7.2% on the dev_man and dev_sge test sets, respectively, surpassing state-of-the-art methods. Through experiments, we found that the proposed method significantly improves the performance on non-native language in CS speech, indicating that our approach enables Whisper to better distinguish between the two languages.
Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement
Wang, Junyu, Lin, Zizhen, Wang, Tianrui, Ge, Meng, Wang, Longbiao, Dang, Jianwu
In parallel, developments in state-space models (SSM) [8], [20] present a promising alternative with linear complexity Speech enhancement (SE) tasks aim to improve speech and high efficiency in handling long-sequence inputs. Mamba clarity by suppressing background noise, reverberation, and [21], as a novel structured SSM (S4), introduces a selective other acoustic interferences, thereby optimizing user experience processing mechanism for input information and an efficient and communication efficacy. In recent years, with the hardware-aware algorithm, achieving performance comparable rapid development of deep learning, a variety of representative to or exceeding Transformer-based methods across domains neural networks have emerged, especially those based on such as natural language, image, and audio [22]-[24]. Particularly, convolutional neural networks (CNN) [1]-[4], transformers a recent work [25] demonstrated improved performance [5]-[7], and U-Net architectures [8]-[10]. Generally, depending with reduced FLOPs by simply replacing the conformer in on the processing method of the input signal, it can be MP-SENet with Mamba, further validating the effectiveness broadly categorized into time-domain and time-frequency (T-of Mamba in speech processing tasks.
EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis
Wang, Haoyu, Qiang, Chunyu, Wang, Tianrui, Gong, Cheng, Liu, Qiuyu, Jiang, Yu, Wang, Xiaobao, Wang, Chenyang, Zhang, Chen
Recent advancements in speech synthesis models, trained on extensive datasets, have demonstrated remarkable zero-shot capabilities. These models can control content, timbre, and emotion in generated speech based on prompt inputs. Despite these advancements, the choice of prompts significantly impacts the output quality, yet most existing selection schemes do not adequately address the control of emotional intensity. To address this question, this paper proposes a two-stage prompt selection strategy EmoPro, which is specifically designed for emotionally controllable speech synthesis. This strategy focuses on selecting highly expressive and high-quality prompts by evaluating them from four perspectives: emotional expression strength, speech quality, text-emotion consistency, and model generation performance. Experimental results show that prompts selected using the proposed method result in more emotionally expressive and engaging synthesized speech compared to those obtained through baseline. Audio samples and codes will be available at https://whyrrrrun.github.io/EmoPro/.
On decoder-only architecture for speech-to-text and large language model integration
Wu, Jian, Gaur, Yashesh, Chen, Zhuo, Zhou, Long, Zhu, Yimeng, Wang, Tianrui, Li, Jinyu, Liu, Shujie, Ren, Bo, Liu, Linquan, Wu, Yu
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Wang, Tianrui, Zhou, Long, Zhang, Ziqiang, Wu, Yu, Liu, Shujie, Gaur, Yashesh, Chen, Zhuo, Li, Jinyu, Wei, Furu
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional codec language model task via multi-task learning framework. To accomplish this, we first convert all the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence conversion problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.