Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Open in new window