On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning

Tavares, Tiago, Ayres, Fabio, Wang, Zhepei, Smaragdis, Paris

Aug-23-2024–arXiv.org Artificial Intelligence

Recent advances in audio-text cross-modal contrastive learning have shown its potential towards zero-shot learning. One possibility for this is by projecting item embeddings from pre-trained backbone neural networks into a cross-modal space in which item similarity can be calculated in either domain. This process relies on a strong unimodal pre-training of the backbone networks, and on a data-intensive training task for the projectors. These two processes can be biased by unintentional data leakage, which can arise from using supervised learning in pre-training or from inadvertently training the cross-modal projection using labels from the zero-shot learning evaluation. In this study, we show that a significant part of the measured zero-shot learning accuracy is due to strengths inherited from the audio and text backbones, that is, they are not learned in the cross-modal domain and are not transferred from one modality to another.

accuracy, data leakage, similarity, (14 more...)

arXiv.org Artificial Intelligence

Aug-23-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Illinois (0.04)
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
  - Florida > Miami-Dade County
    - Miami (0.04)
- Europe > Ireland
  - Leinster > County Dublin > Dublin (0.04)

Genre:
- Research Report > New Finding (0.48)

Industry:
- Transportation (0.32)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks (0.89)