WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction

Emon, Jakaria Islam, Alam, Kazi Tamanna, Salek, Md. Abu

Jun-9-2025–arXiv.org Artificial Intelligence

--Mean Opinion Score (MOS) prediction for text-to-music systems requires evaluating both overall musical quality and text-prompt alignment. This paper introduces WhisQ, a multimodal architecture that addresses this dual-assessment challenge through sequence-level co-attention and optimal transport regularization. WhisQ employs the Whisper-Base pretrained model for temporal audio encoding and Qwen-3, a 0.6B Small Language Model (SLM), for text encoding, with both maintaining sequence structure for fine-grained cross-modal modeling. The architecture features specialized prediction pathways: OMQ is predicted from pooled audio embeddings, while T A leverages bidirectional sequence co-attention between audio and text. Sinkhorn optimal transport loss further enforce semantic alignment in the shared embedding space. On the MusicEval Track-1 dataset, WhisQ achieves substantial improvements over the baseline: 7% improvement in Spearman correlation for OMQ and 14% for T A. Ablation studies reveal that optimal transport regularization provides the largest performance gain (10% SRCC improvement), demonstrating the importance of explicit cross-modal alignment for text-to-music evaluation.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Jun-9-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan > Hokkaidō (0.15)

Genre:
- Research Report (0.50)

Industry:
- Media > Music (0.47)
- Leisure & Entertainment (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.91)
  - Speech (0.69)
  - Machine Learning > Neural Networks (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found