USAD: Universal Speech and Audio Representation via Distillation

Chang, Heng-Jui, Bhati, Saurabhchand, Glass, James, Liu, Alexander H.

Aug-19-2025–arXiv.org Artificial Intelligence

--Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks. In recent years, self-supervised learning (SSL) methods--learning frameworks that utilize unlabeled data without explicit supervision--have significantly advanced representation learning for audio processing. Speech SSL models like wav2vec 2.0 [1], HuBERT [2], and WavLM [3] have become the foundation of many applications like automatic speech recognition (ASR), speaker identification, and phoneme classification. In parallel, SSL approaches developed for audio event classification and music understanding, such as SSAST [4], BEA Ts [5], and MERT [6], have successfully been shown to be effective in non-speech tasks. In practice, the use of audio representation has extended beyond simple downstream tasks.

artificial intelligence, machine learning, representation, (15 more...)

arXiv.org Artificial Intelligence

Aug-19-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China
  - Beijing > Beijing (0.04)
- North America > United States
  - Massachusetts > Middlesex County > Cambridge (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.68)
  - Speech > Speech Recognition (1.00)