Spatio-Temporal Attention Pooling for Audio Scene Classification

Phan, Huy, Chén, Oliver Y., Pham, Lam, Koch, Philipp, De Vos, Maarten, McLoughlin, Ian, Mertins, Alfred

arXiv.org Machine Learning 

Acoustic scenes are rich and redundant in their content. In Given the rich content of acoustic scenes, they typically this work, we present a spatiotemporal attention pooling layer contain a lot of irrelevant and redundant information. This fact coupled with a convolutional recurrent neural network to learn naturally gives rise to the question of how to encourage a deep from patterns that are discriminative while suppressing those learning model to automatically discover and focus on discriminative that are irrelevant for acoustic scene classification. The convolutional patterns and suppress irrelevant ones from the acoustic layers in this network learn invariant features from scenes for better classification. We seek to address that question time-frequency input. The bidirectional recurrent layers are in this work using an attention mechanism [15]. To this end, we then able to encode the temporal dynamics of the resulting convolutional propose a spatiotemporal attention pooling layer in combination features. Afterwards, a two-dimensional attention with a convolutional recurrent neural network (CRNN), inspired mask is formed via the outer product of the spatial and temporal by their success in the audio event detection task [16, 17].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found