AITopics | Liu, Shansong

Collaborating Authors

Liu, Shansong

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Yuan, Ruibin, Lin, Hanfeng, Guo, Shuyue, Zhang, Ge, Pan, Jiahao, Zang, Yongyi, Liu, Haohe, Liang, Yiming, Ma, Wenye, Du, Xingjian, Du, Xinrun, Ye, Zhen, Zheng, Tianyu, Ma, Yinghao, Liu, Minghao, Tian, Zeyue, Zhou, Ziya, Xue, Liumeng, Qu, Xingwei, Li, Yizhi, Wu, Shangda, Shen, Tianhao, Ma, Ziyang, Zhan, Jun, Wang, Chunhui, Wang, Yatian, Chi, Xiaowei, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Liu, Shansong, Mei, Lingrui, Li, Peng, Wang, Junjie, Yu, Jianwei, Pang, Guojian, Li, Xu, Wang, Zihao, Zhou, Xiaohuan, Yu, Lijun, Benetos, Emmanouil, Chen, Yong, Lin, Chenghua, Chen, Xie, Xia, Gus, Zhang, Zhaoxiang, Zhang, Chao, Chen, Wenhu, Zhou, Xinyu, Qiu, Xipeng, Dannenberg, Roger, Liu, Jiaheng, Yang, Jian, Huang, Wenhao, Xue, Wei, Tan, Xu, Guo, Yike

arXiv.org Artificial IntelligenceMar-11-2025

We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

arxiv preprint arxiv, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2503.08638

Country: Asia > Japan (0.24)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.48)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)

Add feedback

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

Liu, Shansong, Hussain, Atin Sakkeer, Sun, Chenshuo, Shan, Ying

arXiv.org Artificial IntelligenceAug-22-2023

Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.

machine learning, natural language, question answering, (17 more...)

arXiv.org Artificial Intelligence

2308.11276

Genre: Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Recent Progress in the CUHK Dysarthric Speech Recognition System

Liu, Shansong, Geng, Mengzhe, Hu, Shoukang, Xie, Xurong, Cui, Mingyu, Yu, Jianwei, Liu, Xunying, Meng, Helen

arXiv.org Artificial IntelligenceJan-15-2022

Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task.

artificial intelligence, health & medicine, machine learning, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TASLP.2021.3091805

2201.05845

Country: Asia > China > Hong Kong (0.25)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

Geng, Mengzhe, Liu, Shansong, Yu, Jianwei, Xie, Xurong, Hu, Shoukang, Ye, Zi, Jin, Zengrui, Liu, Xunying, Meng, Helen

arXiv.org Artificial IntelligenceJan-14-2022

Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed to facilitate both accurate speech intelligibility assessment and auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN and end-to-end disordered speech recognition systems. Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i-Vector adaptation by up to 2.63% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation. Learning hidden unit contribution (LHUC) based speaker adaptation was further applied. The final speaker adapted system using the proposed spectral basis embedding features gave an overall WER of 25.6% on the UASpeech test set of 16 dysarthric speakers

artificial intelligence, health & medicine, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2021-60

2201.05554

Country: Asia > China (0.14)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Geng, Mengzhe, Xie, Xurong, Liu, Shansong, Yu, Jianwei, Hu, Shoukang, Liu, Xunying, Meng, Helen

arXiv.org Artificial IntelligenceJan-14-2022

Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including vocal tract length perturbation (VTLP), tempo perturbation and speed perturbation. Both normal and disordered speech were exploited in the augmentation process. Variability among impaired speakers in both the original and augmented data was modeled using learning hidden unit contributions (LHUC) based speaker adaptive training. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute (9.3% relative) word error rate (WER) reduction over the baseline system without data augmentation, and gave an overall WER of 26.37% on the test set containing 16 dysarthric speakers.

artificial intelligence, health & medicine, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2020-1161

2201.05562

Country:

Asia > China (0.14)
Europe > Sweden (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback