Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings

Wisnu, Dyah A. M. G., Zezario, Ryandhimas E., Rini, Stefano, Wang, Hsin-Min, Tsao, Yu

Sep-4-2025–arXiv.org Artificial Intelligence

--We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores--Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness--for audio generated by text-to-speech (TTS), text-to-audio (TT A), and text-to-music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. T o address this, we combine BEA Ts, a pretrained transformer-based audio representation model, with a multi-branch long short-term memory (LSTM) predictor and use a triplet loss with buffer-based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain-robust audio quality assessment without synthetic training data.

artificial intelligence, deep learning, machine learning, (13 more...)

arXiv.org Artificial Intelligence

Sep-4-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan
  - Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States
  - Hawaii > Honolulu County > Honolulu (0.04)

Genre:
- Research Report > New Finding (0.87)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found