Weakly-supervised Automated Audio Captioning via text only training

Kouzelis, Theodoros, Katsouros, Vassilis

arXiv.org Artificial Intelligence 

While great effort has been done, the data scarcity issue In recent years, datasets of paired audio and captions have enabled of audio captioning still withholds. The common datasets in AAC, remarkable success in automatically generating descriptions AudioCaps and Clotho, contain together 50k captions for training, for audio clips, namely Automated Audio Captioning (AAC). However, whereas 400k captions are provided in COCO caption [8] for image it is labor-intensive and time-consuming to collect a sufficient captioning. Kim et al. [9] observe that due to the limited data, prior number of paired audio and captions. Motivated by the recent arts design decoders with shallow layers that fail to learn generalized advances in Contrastive Language-Audio Pretraining (CLAP), language expressivity and are fitted to the small-scaled target we propose a weakly-supervised approach to train an AAC model dataset. Due to this issue, their performance radically decreases assuming only text data and a pre-trained CLAP model, alleviating when tested on out-of-domain data. Motivated by these limitations the need for paired target data. Our approach leverages the we present an approach to AAC that only requires a pre-trained similarity between audio and text embeddings in CLAP.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found