Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?

Deng, Qixin, Pardo, Bryan, Pappas, Thrasyvoulos N

Oct-17-2025–arXiv.org Artificial Intelligence

Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspondence to human perception of timbre, a multifaceted attribute encompassing qualities such as brightness, roughness, and warmth, remains underexplored. In this paper, we evaluate the above three joint language-audio embedding models on their ability to capture perceptual dimensions of timbre. Our findings show that LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.

descriptor, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-17-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Illinois > Cook County > Evanston (0.04)

Genre:
- Research Report > New Finding (0.87)

Industry:
- Leisure & Entertainment (0.70)
- Media > Music (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.95)
  - Natural Language (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found