Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning
Tuncay, Ludovic, Labbé, Etienne, Benetos, Emmanouil, Pellegrini, Thomas
–arXiv.org Artificial Intelligence
Self-Supervised Learning ( SSL) has revolutionized representation learning for speech and audio, enabling models to learn from unlabeled data and excel in diverse downstream tasks [ 1, 2, 3, 4 ] . Early SSL approaches for audio, such as contrastive predictive coding and wav2vec 2.0, learned latent speech representations by masking the input and solving a contrastive task over latent codes [ 5 ] . Follow-up methods like HuBERT [ 1 ] introduced offline clustering to generate pseudo-labels for masked audio segments and WavLM [ 6 ] applied data augmentation and denoising to improve robustness in speech representation learning. More recently, latent prediction approaches have gained traction: data2vec [ 7 ] and its efficient successor data2vec 2.0 [ 8 ] employ a teacher-student framework to predict contextualized latent representations of the input, achieving strong results across vision, speech, and language tasks. In the audio domain, Niizumi et al. introduced Masked Modeling Duo (M2D) [ 4 ], which uses two networks (online and momentum encoder) to predict masked patch embeddings and attained state-of-the-art results on numerous audio benchmarks. In computer vision, a new paradigm called Joint-Embedding Predictive Architecture (JEP A) [ 9, 10, 11 ] has been proposed to predict hidden content in a high-level latent space instead of pixel space.
arXiv.org Artificial Intelligence
Jul-8-2025
- Country:
- Europe
- France (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- North America > United States (0.04)
- Europe
- Genre:
- Research Report (0.55)
- Technology: