Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Li, Chang, Wang, Ruoyu, Liu, Lijuan, Du, Jun, Sun, Yixuan, Guo, Zilu, Zhang, Zhenrong, Jiang, Yuan

May-24-2024–arXiv.org Artificial Intelligence

In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering a novel approach to synthesizing musical content from textual descriptions. Achieving high accuracy and diversity in this generation process requires extensive, high-quality data, which often constitutes only a fraction of available datasets. Within open-source datasets, the prevalence of issues like mislabeling, weak labeling, unlabeled data, and low-quality music waveform significantly hampers the development of music generation models. To overcome these challenges, we introduce a novel quality-aware masked diffusion transformer (QA-MDT) approach that enables generative models to discern the quality of input music waveform during training. Building on the unique properties of musical signals, we have adapted and implemented a MDT model for TTM task, while further unveiling its distinct capacity for quality control. Moreover, we address the issue of low-quality captions with a caption refinement data processing approach. Our demo page is shown in https://qa-mdt.github.io/.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

May-24-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)
- Europe > Germany (0.14)

Genre:
- Research Report (0.84)

Industry:
- Leisure & Entertainment (0.86)
- Media > Music (0.86)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language > Large Language Model (0.94)
  - Vision (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found