AITopics | Wang, Xinsheng

Plotting

Wang, Xinsheng

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Wang, Xinsheng, Jiang, Mingqi, Ma, Ziyang, Zhang, Ziyu, Liu, Songxiang, Li, Linqin, Liang, Zheng, Zheng, Qixi, Wang, Rui, Feng, Xiaoqin, Bian, Weizhen, Ye, Zhen, Cheng, Sitong, Yuan, Ruibin, Zhao, Zhixian, Zhu, Xinfa, Pan, Jiahao, Xue, Liumeng, Zhu, Pengcheng, Chen, Yunlin, Li, Zhifei, Chen, Xie, Xie, Lei, Guo, Yike, Xue, Wei

arXiv.org Artificial IntelligenceMar-3-2025

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.

arxiv preprint arxiv, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2503.0171

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhao, Zhixian, Zhu, Xinfa, Wang, Xinsheng, Wang, Shuiyuan, Geng, Xuelong, Tian, Wenjie, Xie, Lei

arXiv.org Artificial IntelligenceFeb-25-2025

Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

large language model, machine learning, ser, (18 more...)

arXiv.org Artificial Intelligence

2502.18186

Country:

Asia (1.00)
Europe > Austria > Vienna (0.14)
North America > United States > Hawaii (0.14)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Audio-FLAN: A Preliminary Release

Xue, Liumeng, Zhou, Ziya, Pan, Jiahao, Li, Zixuan, Fan, Shuai, Ma, Yinghao, Cheng, Sitong, Yang, Dongchao, Guo, Haohan, Xiao, Yujia, Wang, Xinsheng, Shen, Zixuan, Zhu, Chuanbo, Zhang, Xinshen, Liu, Tianchi, Yuan, Ruibin, Tian, Zeyue, Liu, Haohe, Benetos, Emmanouil, Zhang, Ge, Guo, Yike, Xue, Wei

arXiv.org Artificial IntelligenceFeb-23-2025

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.16584

Country:

Asia > China (0.28)
North America > United States (0.28)

Genre: Research Report (0.40)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Ye, Zhen, Zhu, Xinfa, Chan, Chi-Min, Wang, Xinsheng, Tan, Xu, Lei, Jiahe, Peng, Yi, Liu, Haohe, Jin, Yizhu, DAI, Zheqi, Lin, Hongzhan, Chen, Jianyi, Du, Xingjian, Xue, Liumeng, Chen, Yunlin, Li, Zhifei, Xie, Lei, Kong, Qiuqiang, Guo, Yike, Xue, Wei

arXiv.org Artificial IntelligenceFeb-6-2025

Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

large language model, machine learning, natural language, (12 more...)

arXiv.org Artificial Intelligence

2502.04128

Country:

Europe (1.00)
Asia > China (0.28)

Genre:

Personal (1.00)
Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (1.00)
Media (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback