Spatio-Temporal Attention Pooling for Audio Scene Classification

Phan, Huy, Chén, Oliver Y., Pham, Lam, Koch, Philipp, De Vos, Maarten, McLoughlin, Ian, Mertins, Alfred

Apr-6-2019–arXiv.org Machine Learning

Acoustic scenes are rich and redundant in their content. In Given the rich content of acoustic scenes, they typically this work, we present a spatiotemporal attention pooling layer contain a lot of irrelevant and redundant information. This fact coupled with a convolutional recurrent neural network to learn naturally gives rise to the question of how to encourage a deep from patterns that are discriminative while suppressing those learning model to automatically discover and focus on discriminative that are irrelevant for acoustic scene classification. The convolutional patterns and suppress irrelevant ones from the acoustic layers in this network learn invariant features from scenes for better classification. We seek to address that question time-frequency input. The bidirectional recurrent layers are in this work using an attention mechanism [15]. To this end, we then able to encode the temporal dynamics of the resulting convolutional propose a spatiotemporal attention pooling layer in combination features. Afterwards, a two-dimensional attention with a convolutional recurrent neural network (CRNN), inspired mask is formed via the outer product of the spatial and temporal by their success in the audio event detection task [16, 17].

classification, deep learning, neural network, (18 more...)

arXiv.org Machine Learning

Apr-6-2019

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom > England (0.14)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found