Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Open in new window