AITopics | Zhao, Zhixian

Collaborating Authors

Zhao, Zhixian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Wang, Xinsheng, Jiang, Mingqi, Ma, Ziyang, Zhang, Ziyu, Liu, Songxiang, Li, Linqin, Liang, Zheng, Zheng, Qixi, Wang, Rui, Feng, Xiaoqin, Bian, Weizhen, Ye, Zhen, Cheng, Sitong, Yuan, Ruibin, Zhao, Zhixian, Zhu, Xinfa, Pan, Jiahao, Xue, Liumeng, Zhu, Pengcheng, Chen, Yunlin, Li, Zhifei, Chen, Xie, Xie, Lei, Guo, Yike, Xue, Wei

arXiv.org Artificial IntelligenceMar-3-2025

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.

arxiv preprint arxiv, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2503.0171

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhao, Zhixian, Zhu, Xinfa, Wang, Xinsheng, Wang, Shuiyuan, Geng, Xuelong, Tian, Wenjie, Xie, Lei

arXiv.org Artificial IntelligenceFeb-25-2025

Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

large language model, machine learning, ser, (18 more...)

arXiv.org Artificial Intelligence

2502.18186

Country:

Asia (1.00)
Europe > Austria > Vienna (0.14)
North America > United States > Hawaii (0.14)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Geng, Xuelong, Wei, Kun, Shao, Qijie, Liu, Shuiyun, Lin, Zhennan, Zhao, Zhixian, Li, Guojian, Tian, Wenjie, Chen, Peikun, Li, Yangze, Guo, Pengcheng, Shao, Mingchen, Wang, Shuiyuan, Cao, Yuang, Wang, Chengyou, Xu, Tianyi, Dai, Yuhang, Zhu, Xinfa, Li, Yue, Zhang, Li, Xie, Lei

arXiv.org Artificial IntelligenceFeb-16-2025

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

artificial intelligence, natural language, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

2501.13306

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback