AITopics | Jiang, Yidi

Collaborating Authors

Jiang, Yidi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation

Zhang, Chong, Ma, Yukun, Chen, Qian, Wang, Wen, Zhao, Shengkui, Pan, Zexu, Wang, Hao, Ni, Chongjia, Nguyen, Trung Hieu, Zhou, Kun, Jiang, Yidi, Tan, Chaohong, Gao, Zhifu, Du, Zhihao, Ma, Bin

arXiv.org Artificial IntelligenceFeb-28-2025

We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.

inspiremusic-1, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2503.00084

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

WavChat: A Survey of Spoken Dialogue Models

Ji, Shengpeng, Chen, Yifu, Fang, Minghui, Zuo, Jialong, Lu, Jingyu, Wang, Hanting, Jiang, Ziyue, Zhou, Long, Liu, Shujie, Cheng, Xize, Yang, Xiaoda, Wang, Zehan, Yang, Qian, Li, Jian, Jiang, Yidi, He, Jingzhen, Chu, Yunfei, Xu, Jin, Zhao, Zhou

arXiv.org Artificial IntelligenceNov-26-2024

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at https://github.com/jishengpeng/WavChat.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2411.13577

Country: Asia > Japan > Honshū (0.27)

Genre: Overview (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Information Technology (0.92)
Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge

Ge, Meng, Peng, Yizhou, Jiang, Yidi, Lin, Jingru, Ao, Junyi, Yildirim, Mehmet Sinan, Wang, Shuai, Li, Haizhou, Feng, Mengling

arXiv.org Artificial IntelligenceDec-26-2023

This paper summarizes our team's efforts in both tracks of the ICMC-ASR Challenge for in-car multi-channel automatic speech recognition. Our submitted systems for ICMC-ASR Challenge include the multi-channel front-end enhancement and diarization, training data augmentation, speech recognition modeling with multi-channel branches. Tested on the offical Eval1 and Eval2 set, our best system achieves a relative 34.3% improvement in CER and 56.5% improvement in cpCER, compared to the offical baseline system.

artificial intelligence, machine learning, pipeline, (12 more...)

arXiv.org Artificial Intelligence

2312.16002

Country: Asia > China (0.50)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.91)

Add feedback

Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation

Du, Jiawei, Jiang, Yidi, Tan, Vincent Y. F., Zhou, Joey Tianyi, Li, Haizhou

arXiv.org Artificial IntelligenceMar-25-2023

Model-based deep learning has achieved astounding successes due in part to the availability of large-scale real-world data. However, processing such massive amounts of data comes at a considerable cost in terms of computations, storage, training and the search for good neural architectures. Dataset distillation has thus recently come to the fore. This paradigm involves distilling information from large real-world datasets into tiny and compact synthetic datasets such that processing the latter ideally yields similar performances as the former. State-of-the-art methods primarily rely on learning the synthetic dataset by matching the gradients obtained during training between the real and synthetic data. However, these gradient-matching methods suffer from the so-called accumulated trajectory error caused by the discrepancy between the distillation and subsequent evaluation. To mitigate the adverse impact of this accumulated trajectory error, we propose a novel approach that encourages the optimization algorithm to seek a flat trajectory. We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory. Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7% on a subset of images of the ImageNet dataset with higher resolution images. We also validate the effectiveness and generalizability of our method with datasets of different resolutions and demonstrate its applicability to neural architecture search. Code is available at https://github.com/AngusDujw/FTD-distillation.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2211.11004

Country:

Asia > Singapore (0.28)
Asia > China (0.28)

Genre: Research Report > Promising Solution (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback