MAST: Multiscale Audio Spectrogram Transformers

Ghosh, Sreyan, Seth, Ashish, Umesh, S., Manocha, Dinesh

arXiv.org Artificial Intelligence 

ABSTRACT We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST) [1]. Given an input audio spectrogram, we first patchify and project it into an initial temporal resolution and embedding dimension, post which the multiple stages in MAST progressively expand the embedding dimension while reducing the temporal resolution of the input. MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the To confirm our hypothesis on hierarchically structured LAPE Benchmark [2], achieving state-of-the-art results on natural signals, we highlight a key architectural design choice keyword spotting in Speech Commands. Additionally, our common across the best performing CNN-based architectures proposed SS-MAST achieves an absolute average improvement for audio classification in literature. With a spectrogram as of 2.6% over the previously proposed SSAST [3] This design choice for 1. INTRODUCTION pure-CNN models allows them to hierarchically learn simple low-level acoustic features in the lower stages aided by Natural signals such as speech and audio are hierarchically high temporal and low embedding dimensions to complex structured across various different timescales, spanning tens high-level acoustic features in the higher stages aided by low (e.g., phonemes) to hundreds (e.g., words) of milliseconds.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found