CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining

Tsoi, Tristan, Deng, Jiajun, Ju, Yaolong, Weck, Benno, Kirchhoff, Holger, Lui, Simon

Mar-29-2025–arXiv.org Artificial Intelligence

--Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. T o overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Extensive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform. Music similarity retrieval plays an important role in many music information retrieval (MIR) tasks, such as music recommendation [1], personalized playlist generation [2] and background music replacement in video editing [3], [4]. As digital music collections rapidly expand within streaming platforms, accurately identifying similarities between musical pieces has become critical for managing and exploring relevant content from such large collections efficiently.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Mar-29-2025

arXiv.org PDF

Add feedback

Country:
- Europe
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Germany > Bavaria
    - Upper Bavaria > Munich (0.04)
- Asia > China
  - Hong Kong (0.05)

Genre:
- Research Report (0.50)

Industry:
- Media > Music (1.00)
- Leisure & Entertainment (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found