SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

Alex, Tony, Ahmed, Sara, Mustafa, Armin, Awais, Muhammad, Jackson, Philip JB

Jun-17-2025–arXiv.org Artificial Intelligence

Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the self-supervised pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio self-supervised learning (SSL) methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce S elf-S upervised L earning from A udio M ixtures (SSLAM), a novel direction in audio SSL research, designed to improve the model's ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against state-of-the-art (SOT A) methods using a range of high-quality, publicly available polyphonic datasets. SS-LAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9% improvement on the AudioSet-2M(AS-2M), reaching a mean average precision (mAP) of 50.2. These results demonstrate SSLAM's effectiveness in both polyphonic and monophonic soundscapes, significantly enhancing the performance of audio SSL models. Code and pre-trained models are available at https://github.com/ta012/SSLAM .

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jun-17-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.48)

Industry:
- Education (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.66)
  - Machine Learning
    - Neural Networks (0.68)
    - Performance Analysis (0.67)
    - Inductive Learning (0.67)
    - Statistical Learning (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found