AITopics | Wang, Tianrui

Collaborating Authors

Wang, Tianrui

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding

Zhao, Jiahui, Shi, Hao, Cui, Chenrui, Wang, Tianrui, Liu, Hexin, Ni, Zhaoheng, Ye, Lingxuan, Wang, Longbiao

arXiv.org Artificial IntelligenceJan-5-2025

Code-switching (CS) automatic speech recognition (ASR) faces challenges due to the language confusion resulting from accents, auditory similarity, and seamless language switches. Adaptation on the pre-trained multi-lingual model has shown promising performance for CS-ASR. In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. First, we propose an encoder refiner to enhance the encoder's capacity of intra-sentence swithching. Second, we propose using two sets of language-aware adapters with different language prompt embeddings to achieve language-specific decoding information in each decoder layer. Then, a fusion module is added to fuse the language-aware decoding. The experimental results using the SEAME dataset show that, compared with the baseline model, the proposed approach achieves a relative MER reduction of 4.1% and 7.2% on the dev_man and dev_sge test sets, respectively, surpassing state-of-the-art methods. Through experiments, we found that the proposed method significantly improves the performance on non-native language in CS speech, indicating that our approach enables Whisper to better distinguish between the two languages.

machine learning, natural language, recognition, (16 more...)

arXiv.org Artificial Intelligence

2412.16507

Country: Asia > China (0.29)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement

Wang, Junyu, Lin, Zizhen, Wang, Tianrui, Ge, Meng, Wang, Longbiao, Dang, Jianwu

arXiv.org Artificial IntelligenceJan-2-2025

In parallel, developments in state-space models (SSM) [8], [20] present a promising alternative with linear complexity Speech enhancement (SE) tasks aim to improve speech and high efficiency in handling long-sequence inputs. Mamba clarity by suppressing background noise, reverberation, and [21], as a novel structured SSM (S4), introduces a selective other acoustic interferences, thereby optimizing user experience processing mechanism for input information and an efficient and communication efficacy. In recent years, with the hardware-aware algorithm, achieving performance comparable rapid development of deep learning, a variety of representative to or exceeding Transformer-based methods across domains neural networks have emerged, especially those based on such as natural language, image, and audio [22]-[24]. Particularly, convolutional neural networks (CNN) [1]-[4], transformers a recent work [25] demonstrated improved performance [5]-[7], and U-Net architectures [8]-[10]. Generally, depending with reduced FLOPs by simply replacing the conformer in on the processing method of the input signal, it can be MP-SENet with Mamba, further validating the effectiveness broadly categorized into time-domain and time-frequency (T-of Mamba in speech processing tasks.

artificial intelligence, machine learning, mamba-seunet, (15 more...)

arXiv.org Artificial Intelligence

2412.16626

Country: Asia > China (0.48)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Wang, Haoyu, Qiang, Chunyu, Wang, Tianrui, Gong, Cheng, Liu, Qiuyu, Jiang, Yu, Wang, Xiaobao, Wang, Chenyang, Zhang, Chen

arXiv.org Artificial IntelligenceSep-27-2024

Recent advancements in speech synthesis models, trained on extensive datasets, have demonstrated remarkable zero-shot capabilities. These models can control content, timbre, and emotion in generated speech based on prompt inputs. Despite these advancements, the choice of prompts significantly impacts the output quality, yet most existing selection schemes do not adequately address the control of emotional intensity. To address this question, this paper proposes a two-stage prompt selection strategy EmoPro, which is specifically designed for emotionally controllable speech synthesis. This strategy focuses on selecting highly expressive and high-quality prompts by evaluating them from four perspectives: emotional expression strength, speech quality, text-emotion consistency, and model generation performance. Experimental results show that prompts selected using the proposed method result in more emotionally expressive and engaging synthesized speech compared to those obtained through baseline. Audio samples and codes will be available at https://whyrrrrun.github.io/EmoPro/.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2409.18512

Country: Asia > China (0.29)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

On decoder-only architecture for speech-to-text and large language model integration

Wu, Jian, Gaur, Yashesh, Chen, Zhuo, Zhou, Long, Zhu, Yimeng, Wang, Tianrui, Li, Jinyu, Liu, Shujie, Ren, Bo, Liu, Linquan, Wu, Yu

arXiv.org Artificial IntelligenceOct-2-2023

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.03917

Country:

North America > Canada (0.14)
Europe > Belgium (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Wang, Tianrui, Zhou, Long, Zhang, Ziqiang, Wu, Yu, Liu, Shujie, Gaur, Yashesh, Chen, Zhuo, Li, Jinyu, Wei, Furu

arXiv.org Artificial IntelligenceMay-25-2023

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional codec language model task via multi-task learning framework. To accomplish this, we first convert all the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence conversion problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.

machine learning, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2305.16107

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback