AITopics | Ma, Yukun

Collaborating Authors

Ma, Yukun

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation

Zhang, Chong, Ma, Yukun, Chen, Qian, Wang, Wen, Zhao, Shengkui, Pan, Zexu, Wang, Hao, Ni, Chongjia, Nguyen, Trung Hieu, Zhou, Kun, Jiang, Yidi, Tan, Chaohong, Gao, Zhifu, Du, Zhihao, Ma, Bin

arXiv.org Artificial IntelligenceFeb-28-2025

We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.

inspiremusic-1, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2503.00084

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

Zhou, Kun, Zhang, You, Zhao, Shengkui, Wang, Hao, Pan, Zexu, Ng, Dianwen, Zhang, Chong, Ni, Chongjia, Ma, Yukun, Nguyen, Trung Hieu, Yip, Jia Qi, Ma, Bin

arXiv.org Artificial IntelligenceSep-25-2024

Current emotional text-to-speech (TTS) systems face challenges in mimicking a broad spectrum of human emotions due to the inherent complexity of emotions and limitations in emotional speech datasets and models. This paper proposes a TTS framework that facilitates control over pleasure, arousal, and dominance, and can synthesize a diversity of emotional styles without requiring any emotional speech data during TTS training. We train an emotional attribute predictor using only categorical labels from speech data, aligning with psychological research and incorporating anchored dimensionality reduction on self-supervised learning (SSL) features. The TTS framework converts text inputs into phonetic tokens via an autoregressive language model and uses pseudo-emotional dimensions to guide the parallel prediction of fine-grained acoustic details. Experiments conducted on the LibriTTS dataset demonstrate that our framework can synthesize speech with enhanced naturalness and a variety of emotional styles by effectively controlling emotional dimensions, even without the inclusion of any emotional speech during TTS training.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.16681

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.89)

Add feedback

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Zhang, Shiliang, Deng, Chong, Yu, Hai, Liu, Jiaqing, Ma, Yukun, Zhang, Chong

arXiv.org Artificial IntelligenceJun-17-2024

The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention between non-adjacent layers. This method improves the model's ability to capture dependencies between high-level abstract features and low-level details. By facilitating direct attention between these diverse feature levels, our approach overcomes the limitations of current Transformers, which often rely on suboptimal intra-layer attention. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer, thus enhancing the diversity of multi-head attention without additional computational burden. Extensive experiments demonstrate that our enhanced Transformer model achieves superior performance in language modeling tasks, highlighting the effectiveness of our skip-layer attention mechanism.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.11274

Country:

North America > United States > California (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Zhou, Kun, Zhao, Shengkui, Ma, Yukun, Zhang, Chong, Wang, Hao, Ng, Dianwen, Ni, Chongjia, Hieu, Nguyen Trung, Yip, Jia Qi, Ma, Bin

arXiv.org Artificial IntelligenceJun-11-2024

Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2406.02009

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

LiqD: A Dynamic Liquid Level Detection Model under Tricky Small Containers

Ma, Yukun, Mao, Zikun

arXiv.org Artificial IntelligenceMar-13-2024

In daily life and industrial production, it is crucial to accurately detect changes in liquid level in containers. Traditional contact measurement methods have some limitations, while emerging non-contact image processing technology shows good application prospects. This paper proposes a container dynamic liquid level detection model based on U^2-Net. This model uses the SAM model to generate an initial data set, and then evaluates and filters out high-quality pseudo-label images through the SemiReward framework to build an exclusive data set. The model uses U^2-Net to extract mask images of containers from the data set, and uses morphological processing to compensate for mask defects. Subsequently, the model calculates the grayscale difference between adjacent video frame images at the same position, segments the liquid level change area by setting a difference threshold, and finally uses a lightweight neural network to classify the liquid level state. This approach not only mitigates the impact of intricate surroundings, but also reduces the demand for training data, showing strong robustness and versatility. A large number of experimental results show that the proposed model can effectively detect the dynamic liquid level changes of the liquid in the container, providing a novel and efficient solution for related fields.

artificial intelligence, detection, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2403.08273

Country:

Asia > China (0.15)
Europe > Germany (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Media (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Zhang, Shiliang, Deng, Chong, Ma, Yukun, Yu, Hai, Liu, Jiaqing, Zhang, Chong

arXiv.org Artificial IntelligenceFeb-4-2024

Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld

large language model, loss masking, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2311.04534

Country:

Europe > Germany (0.28)
Oceania > Australia > Queensland (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Qatar (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers

Zheng, Chen, Sun, Ke, Tang, Da, Ma, Yukun, Zhang, Yuyu, Xi, Chenguang, Zhou, Xun

arXiv.org Artificial IntelligenceJan-4-2024

The emergence of Large Language Models (LLMs) such as ChatGPT and LLaMA encounter limitations in domain-specific tasks, with these models often lacking depth and accuracy in specialized areas, and exhibiting a decrease in general capabilities when fine-tuned, particularly analysis ability in small sized models. To address these gaps, we introduce ICE-GRT, utilizing Reinforcement Learning from Human Feedback (RLHF) grounded in Proximal Policy Optimization (PPO), demonstrating remarkable ability in in-domain scenarios without compromising general task performance. Our exploration of ICE-GRT highlights its understanding and reasoning ability to not only generate robust answers but also to provide detailed analyses of the reasons behind the answer. This capability marks a significant progression beyond the scope of Supervised Fine-Tuning models. The success of ICE-GRT is dependent on several crucial factors, including Appropriate Data, Reward Size Scaling, KL-Control, Advantage Normalization, etc. The ICE-GRT model exhibits state-of-the-art performance in domain-specific tasks and across 12 general Language tasks against equivalent size and even larger size LLMs, highlighting the effectiveness of our approach. We provide a comprehensive analysis of the ICE-GRT, underscoring the significant advancements it brings to the field of LLM.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2401.02072

Country: North America > United States (0.94)

Genre:

Research Report (0.64)
Overview (0.46)

Industry:

Government (1.00)
Banking & Finance > Economy (1.00)
Health & Medicine (0.94)
Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ditto: A Simple and Efficient Approach to Improve Sentence Embeddings

Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Deng, Chong, Yu, Hai, Liu, Jiaqing, Ma, Yukun, Zhang, Chong

arXiv.org Artificial IntelligenceOct-23-2023

Prior studies diagnose the anisotropy problem in sentence representations from pre-trained language models, e.g., BERT, without fine-tuning. Our analysis reveals that the sentence embeddings from BERT suffer from a bias towards uninformative words, limiting the performance in semantic textual similarity (STS) tasks. To address this bias, we propose a simple and efficient unsupervised approach, Diagonal Attention Pooling (Ditto), which weights words with model-based importance estimations and computes the weighted average of word representations from pre-trained models as sentence embeddings. Ditto can be easily applied to any pre-trained language model as a postprocessing operation. Compared to prior sentence embedding approaches, Ditto does not add parameters nor requires any learning. Empirical evaluations demonstrate that our proposed Ditto can alleviate the anisotropy problem and improve various pre-trained models on STS tasks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.10786

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Balancing Specialized and General Skills in LLMs: The Impact of Modern Tuning and Data Strategy

Zhang, Zheng, Zheng, Chen, Tang, Da, Sun, Ke, Ma, Yukun, Bu, Yingtong, Zhou, Xun, Zhao, Liang

arXiv.org Artificial IntelligenceOct-7-2023

This paper introduces a multifaceted methodology for fine-tuning and evaluating large language models (LLMs) for specialized monetization tasks. The goal is to balance general language proficiency with domain-specific skills. The methodology has three main components: 1) Carefully blending in-domain and general-purpose data during fine-tuning to achieve an optimal balance between general and specialized capabilities; 2) Designing a comprehensive evaluation framework with 45 questions tailored to assess performance on functionally relevant dimensions like reliability, consistency, and business impact; 3) Analyzing how model size and continual training influence metrics to guide efficient resource allocation during fine-tuning. The paper details the design, data collection, analytical techniques, and results validating the proposed frameworks. It aims to provide businesses and researchers with actionable insights on effectively adapting LLMs for specialized contexts. We also intend to make public the comprehensive evaluation framework, which includes the 45 tailored questions and their respective scoring guidelines, to foster transparency and collaboration in adapting LLMs for specialized tasks.

balancing specialized and general skill, large language model, natural language, (4 more...)

arXiv.org Artificial Intelligence

2310.04945

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

Yip, Jia Qi, Truong, Tuan, Ng, Dianwen, Zhang, Chong, Ma, Yukun, Nguyen, Trung Hieu, Ni, Chongjia, Zhao, Shengkui, Chng, Eng Siong, Ma, Bin

arXiv.org Artificial IntelligenceMay-20-2023

In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we build a Multi-Layer Aggregation (MLA) block using ACA to generate fixed-sized identity vectors from variable-length inputs. Through global attention, ACA-Net acts as an efficient global feature extractor that adapts to temporal variability unlike existing SV models that apply a fixed function for pooling over the temporal dimension which may obscure information about the signal's nonstationary temporal variability. Our experiments on the WSJ0-1talker show ACA-Net outperforms a strong baseline by 5% Figure 1: Overall architecture of ACA-Net. The model consists relative improvement in EER using only 1/5 of the parameters. of a single 1x1 TDNN block, followed by the Multi-Layer Aggregation Index Terms: Speaker Verification, Asymmetric Cross Attention, (MLA) block.

aca-net, artificial intelligence, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

2305.12121

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback