MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
Deng, Zihao, Ma, Yinghao, Liu, Yudong, Guo, Rongchen, Zhang, Ge, Chen, Wenhu, Huang, Wenhao, Benetos, Emmanouil
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen Vicuna-7B language model (an adaption of LLaMA), bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q\&A datasets, we created the Music Instruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
arXiv.org Artificial Intelligence
Oct-12-2023
- Country:
- North America > United States (0.68)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Leisure & Entertainment (1.00)
- Media > Music (1.00)
- Technology: